RPLSH¶
Random Projection LSH for approximate nearest neighbor search under cosine similarity.
Classes¶
culsh.RPLSH ¶
Locality sensitive hashing using random projections. This approximates cosine distance between vectors for ANN search.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_hash_tables
|
int
|
Number of hash tables (OR-amplification of the locality-sensitive family). More tables provide additional independent chances to find neighbors, improving recall at the cost of more false positives. Corresponds to 'b' in the amplified probability (1-(1-s^r)^b), where s is the cosine similarity between two vectors. |
required |
n_hashes
|
int
|
Number of hashes (random projections) per table (AND-amplification of the locality-sensitive family). More hashes require samples to agree on more hash bits, increasing precision at the cost of more false negatives. Corresponds to 'r' in the amplified probability (1-(1-s^r)^b), where s is the cosine similarity between two vectors. |
required |
seed
|
int
|
Random seed for reproducible hashes. If None (default), a random seed is used. |
None
|
Examples:
>>> import numpy as np
>>> from culsh import RPLSH
>>>
>>> # Create random data
>>> X = np.random.randn(10000, 128).astype(np.float32)
>>> Q = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit model
>>> lsh = RPLSH(n_hash_tables=16, n_hashes=8)
>>> model = lsh.fit(X)
>>>
>>> # Query for candidates
>>> candidates = model.query(Q)
>>> indices = candidates.get_indices()
>>> counts = candidates.get_counts()
>>> offsets = candidates.get_offsets()
fit ¶
Fit the RPLSH model on input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Input vectors. Can be numpy or cupy array. |
required |
Returns:
| Type | Description |
|---|---|
RPLSHModel
|
The fitted model containing the LSH index. |
fit_query ¶
Simultaneously fit and query the LSH index. This is more efficient than calling fit(X) followed by query(X) because it avoids a search step to find matching buckets. Note: input vectors are considered candidate neighbors of themselves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Input vectors to fit and query. Can be numpy or cupy array. |
required |
batch_size
|
int
|
If specified, process queries in batches of this size to reduce peak memory usage. Note that this will fall back to calling fit(X) followed by query(X) with batching. |
None
|
Returns:
| Type | Description |
|---|---|
Candidates
|
Query results containing candidate indices for each sample. |
culsh.RPLSHModel ¶
Model produced by RPLSH.fit() containing the fitted LSH index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_hash_tables
|
int
|
Number of hash tables. |
required |
n_hashes
|
int
|
Number of hashes per hash table. |
required |
n_features
|
int
|
Number of features. |
required |
core
|
RPLSHCore
|
Core RPLSH object containing the fitted index. |
required |
index
|
Index
|
Fitted index. |
required |
load
classmethod
¶
Load the RPLSH model from an npz file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to load the model from. |
required |
query ¶
Find candidate neighbors for the query vectors Q.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Q
|
array-like of shape (n_queries, n_features)
|
Query vectors. Can be numpy or cupy array. |
required |
batch_size
|
int
|
If specified, process queries in batches of this size to reduce peak memory usage. |
None
|
Returns:
| Type | Description |
|---|---|
Candidates
|
Query results containing candidate indices for each query. |
save ¶
Save the RPLSH model to an npz file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to save the model. |
required |