PStableLSH¶

p-Stable LSH for approximate nearest neighbor search under Euclidean distance.

Classes¶

culsh.PStableLSH ¶

PStableLSH(n_hash_tables: int, n_hashes: int, window_size: Union[int, Literal['auto']] = 'auto', seed: Optional[int] = None)

Locality sensitive hashing using p-stable distributions. This approximates Euclidean distance between vectors for ANN search.

Parameters:

Name	Type	Description	Default
`n_hash_tables`	`int`	Number of hash tables (OR-amplification of the locality-sensitive family). More tables provide additional independent chances to find neighbors, improving recall at the cost of more false positives. Corresponds to 'b' in the amplified probability (1-(1-p_w^r)^b), where p_w is the probability of collision for window size w.	required
`n_hashes`	`int`	Number of hashes (random projections) per table (AND-amplification of the locality-sensitive family). More hashes require samples to agree on more hash bits, increasing precision at the cost of more false negatives. Corresponds to 'r' in the amplified probability (1-(1-p_w^r)^b), where p_w is the probability of collision for window size w.	required
`window_size`	`int or auto`	The quantization width for projections. Determines the resolution of the hash function by defining the physical size of the hash buckets. Larger window size increases the base collision probability. If "auto", the window size is estimated from data magnitude during fit() by sampling vectors and computing mean magnitude.	`'auto'`
`seed`	`int`	Random seed for reproducible hashes. If None (default), a random seed is used.	`None`

Examples:

>>> import numpy as np
>>> from culsh import PStableLSH
>>>
>>> # Create random data
>>> X = np.random.randn(10000, 128).astype(np.float32)
>>> Q = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit model
>>> lsh = PStableLSH(n_hash_tables=16, n_hashes=8, window_size=4)
>>> model = lsh.fit(X)
>>>
>>> # Query for candidates
>>> candidates = model.query(Q)
>>> indices = candidates.get_indices()
>>> counts = candidates.get_counts()
>>> offsets = candidates.get_offsets()

n_hash_tables `property` ¶

n_hash_tables: int

Number of hash tables.

n_hashes `property` ¶

n_hashes: int

Number of hash functions per hash table.

seed `property` ¶

seed: int

Random seed.

window_size `property` ¶

window_size: Union[int, Literal['auto']]

Window size or 'auto' if estimated at fit time.

fit ¶

fit(X: Union[ndarray, ndarray]) -> PStableLSHModel

Fit the PStableLSH model on input data.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Input vectors. Can be numpy or cupy array.	required

Returns:

Type	Description
`PStableLSHModel`	The fitted model containing the LSH index.

fit_query ¶

fit_query(X: Union[ndarray, ndarray], batch_size: Optional[int] = None) -> Candidates

Simultaneously fit and query the LSH index. This is more efficient than calling fit(X) followed by query(X) because it avoids a search step to find matching buckets. Note: input vectors are considered candidate neighbors of themselves.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Input vectors to fit and query. Can be numpy or cupy array.	required
`batch_size`	`int`	If specified, process queries in batches of this size to reduce peak memory usage. Note that this will fall back to calling fit(X) followed by query(X) with batching.	`None`

Returns:

Type	Description
`Candidates`	Query results containing candidate indices for each sample.

culsh.PStableLSHModel ¶

PStableLSHModel(n_hash_tables: int, n_hashes: int, window_size: int, n_features: int, core: PSLSHCore, index: PSLSHIndex)

Model produced by PStableLSH.fit() containing the fitted PStableLSH index.

Parameters:

Name	Type	Description	Default
`n_hash_tables`	`int`	Number of hash tables.	required
`n_hashes`	`int`	Number of hashes per hash table.	required
`n_features`	`int`	Number of features.	required
`window_size`	`int`	Window size.	required
`core`	`PSLSHCore`	Core PStableLSH object containing the fitted index.	required
`index`	`PSLSHIndex`	Fitted index.	required

index `property` ¶

index: PSLSHIndex

The fitted index.

n_features `property` ¶

n_features: int

Number of features.

n_hash_tables `property` ¶

n_hash_tables: int

Number of hash tables.

n_hashes `property` ¶

n_hashes: int

Number of hashes per hash table.

window_size `property` ¶

window_size: int

Window size.

load `classmethod` ¶

load(path: str) -> PStableLSHModel

Load the PStableLSH model from an npz file.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to load the model from.	required

query ¶

query(Q: Union[ndarray, ndarray], batch_size: Optional[int] = None) -> Candidates

Find candidate neighbors for the query vectors Q.

Parameters:

Name	Type	Description	Default
`Q`	`array-like of shape (n_queries, n_features)`	Query vectors. Can be numpy or cupy array.	required
`batch_size`	`int`	If specified, process queries in batches of this size to reduce peak memory usage.	`None`

Returns:

Type	Description
`Candidates`	Query results containing candidate indices for each query.

save ¶

save(path: str) -> None

Save the PStableLSH model to an npz file.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to save the model.	required

PStableLSH¶

Classes¶

culsh.PStableLSH ¶

n_hash_tables property ¶

n_hashes property ¶

seed property ¶

window_size property ¶

fit ¶

fit_query ¶

culsh.PStableLSHModel ¶

index property ¶

n_features property ¶

n_hash_tables property ¶

n_hashes property ¶

window_size property ¶

load classmethod ¶

query ¶

save ¶

n_hash_tables `property` ¶

n_hashes `property` ¶

seed `property` ¶

window_size `property` ¶

index `property` ¶

n_features `property` ¶

n_hash_tables `property` ¶

n_hashes `property` ¶

window_size `property` ¶

load `classmethod` ¶