MinHashLSH¶
MinHash LSH for approximate nearest neighbor search under Jaccard similarity on sparse/binary data.
Classes¶
culsh.MinHashLSH ¶
Locality sensitive hashing using min-wise independent permutations. This approximates Jaccard similarity between sets for ANN search.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_hash_tables
|
int
|
Number of hash tables (OR-amplification of the locality-sensitive family). More tables provide additional independent chances to find neighbors, improving recall at the cost of more false positives. Corresponds to 'b' in the amplified probability of collision (1-(1-s^r)^b), where s is the Jaccard similarity between two sets. |
required |
n_hashes
|
int
|
Number of hash functions per table (AND-amplification of the locality-sensitive family). More hashes require samples to agree on more hash bits, increasing precision at the cost of more false negatives. Corresponds to 'r' in the amplified probability of collision (1-(1-s^r)^b), where s is the Jaccard similarity between two sets. |
required |
seed
|
int
|
Random seed for reproducible hashes. If None (default), a random seed is used. |
None
|
Examples:
>>> import numpy as np
>>> import scipy.sparse
>>> from culsh import MinHashLSH
>>>
>>> # Create random sparse data
>>> X = scipy.sparse.random(10000, 1000, density=0.01, format='csr')
>>> Q = scipy.sparse.random(100, 1000, density=0.01, format='csr')
>>>
>>> # Fit model
>>> lsh = MinHashLSH(n_hash_tables=16, n_hashes=8)
>>> model = lsh.fit(X)
>>>
>>> # Query for candidates
>>> candidates = model.query(Q)
>>> indices = candidates.get_indices()
>>> counts = candidates.get_counts()
>>> offsets = candidates.get_offsets()
fit ¶
Fit the MinHashLSH model on input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
CSR matrix of shape (n_samples, n_features)
|
Sparse input matrix. Can be scipy or cupyx CSR matrix. Note the matrix is assumed to represent binary set membership (magnitude of data values are ignored). |
required |
Returns:
| Type | Description |
|---|---|
MinHashLSHModel
|
The fitted model containing the LSH index. |
fit_query ¶
Simultaneously fit and query the LSH index. This is more efficient than calling fit(X) followed by query(X) because it avoids a search step to find matching buckets. Note: input vectors are considered candidate neighbors of themselves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
CSR matrix of shape (n_samples, n_features)
|
Sparse input matrix. Can be scipy or cupyx CSR matrix. |
required |
Returns:
| Type | Description |
|---|---|
Candidates
|
Query results containing candidate indices for each sample. |
culsh.MinHashLSHModel ¶
MinHashLSHModel(n_hash_tables: int, n_hashes: int, n_features: int, core: MinHashCore, index: MinHashIndex)
Model produced by MinHashLSH.fit() containing the fitted LSH index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_hash_tables
|
int
|
Number of hash tables. |
required |
n_hashes
|
int
|
Number of hash functions per hash table. |
required |
n_features
|
int
|
Number of features. |
required |
core
|
MinHashCore
|
Core MinHash object containing the fitted index. |
required |
index
|
MinHashIndex
|
Fitted index. |
required |
load
classmethod
¶
Load the MinHashLSH model from an npz file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to load the model from. |
required |
query ¶
Find candidate neighbors for the query vectors Q.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Q
|
CSR matrix of shape (n_queries, n_features)
|
Sparse query matrix. Can be scipy or cupyx CSR matrix. |
required |
Returns:
| Type | Description |
|---|---|
Candidates
|
Query results containing candidate indices for each query. |
save ¶
Save the MinHashLSH model to an npz file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to save the model. |
required |