PStableLSH¶
p-Stable LSH for approximate nearest neighbor search under Euclidean distance.
Classes¶
culsh.PStableLSH ¶
PStableLSH(n_hash_tables: int, n_hashes: int, window_size: Union[int, Literal['auto']] = 'auto', seed: Optional[int] = None)
Locality sensitive hashing using p-stable distributions. This approximates Euclidean distance between vectors for ANN search.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_hash_tables
|
int
|
Number of hash tables (OR-amplification of the locality-sensitive family). More tables provide additional independent chances to find neighbors, improving recall at the cost of more false positives. Corresponds to 'b' in the amplified probability (1-(1-p_w^r)^b), where p_w is the probability of collision for window size w. |
required |
n_hashes
|
int
|
Number of hashes (random projections) per table (AND-amplification of the locality-sensitive family). More hashes require samples to agree on more hash bits, increasing precision at the cost of more false negatives. Corresponds to 'r' in the amplified probability (1-(1-p_w^r)^b), where p_w is the probability of collision for window size w. |
required |
window_size
|
int or auto
|
The quantization width for projections. Determines the resolution of the hash function by defining the physical size of the hash buckets. Larger window size increases the base collision probability. If "auto", the window size is estimated from data magnitude during fit() by sampling vectors and computing mean magnitude. |
'auto'
|
seed
|
int
|
Random seed for reproducible hashes. If None (default), a random seed is used. |
None
|
Examples:
>>> import numpy as np
>>> from culsh import PStableLSH
>>>
>>> # Create random data
>>> X = np.random.randn(10000, 128).astype(np.float32)
>>> Q = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit model
>>> lsh = PStableLSH(n_hash_tables=16, n_hashes=8, window_size=4)
>>> model = lsh.fit(X)
>>>
>>> # Query for candidates
>>> candidates = model.query(Q)
>>> indices = candidates.get_indices()
>>> counts = candidates.get_counts()
>>> offsets = candidates.get_offsets()
window_size
property
¶
Window size or 'auto' if estimated at fit time.
fit ¶
Fit the PStableLSH model on input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Input vectors. Can be numpy or cupy array. |
required |
Returns:
| Type | Description |
|---|---|
PStableLSHModel
|
The fitted model containing the LSH index. |
fit_query ¶
Simultaneously fit and query the LSH index. This is more efficient than calling fit(X) followed by query(X) because it avoids a search step to find matching buckets. Note: input vectors are considered candidate neighbors of themselves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Input vectors to fit and query. Can be numpy or cupy array. |
required |
batch_size
|
int
|
If specified, process queries in batches of this size to reduce peak memory usage. Note that this will fall back to calling fit(X) followed by query(X) with batching. |
None
|
Returns:
| Type | Description |
|---|---|
Candidates
|
Query results containing candidate indices for each sample. |
culsh.PStableLSHModel ¶
PStableLSHModel(n_hash_tables: int, n_hashes: int, window_size: int, n_features: int, core: PSLSHCore, index: PSLSHIndex)
Model produced by PStableLSH.fit() containing the fitted PStableLSH index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_hash_tables
|
int
|
Number of hash tables. |
required |
n_hashes
|
int
|
Number of hashes per hash table. |
required |
n_features
|
int
|
Number of features. |
required |
window_size
|
int
|
Window size. |
required |
core
|
PSLSHCore
|
Core PStableLSH object containing the fitted index. |
required |
index
|
PSLSHIndex
|
Fitted index. |
required |
load
classmethod
¶
Load the PStableLSH model from an npz file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to load the model from. |
required |
query ¶
Find candidate neighbors for the query vectors Q.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Q
|
array-like of shape (n_queries, n_features)
|
Query vectors. Can be numpy or cupy array. |
required |
batch_size
|
int
|
If specified, process queries in batches of this size to reduce peak memory usage. |
None
|
Returns:
| Type | Description |
|---|---|
Candidates
|
Query results containing candidate indices for each query. |
save ¶
Save the PStableLSH model to an npz file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to save the model. |
required |