RPLSH¶

Random Projection LSH for approximate nearest neighbor search under cosine similarity.

Classes¶

culsh.RPLSH ¶

RPLSH(n_hash_tables: int, n_hashes: int, seed: Optional[int] = None)

Locality sensitive hashing using random projections. This approximates cosine distance between vectors for ANN search.

Parameters:

Name	Type	Description	Default
`n_hash_tables`	`int`	Number of hash tables (OR-amplification of the locality-sensitive family). More tables provide additional independent chances to find neighbors, improving recall at the cost of more false positives. Corresponds to 'b' in the amplified probability (1-(1-s^r)^b), where s is the cosine similarity between two vectors.	required
`n_hashes`	`int`	Number of hashes (random projections) per table (AND-amplification of the locality-sensitive family). More hashes require samples to agree on more hash bits, increasing precision at the cost of more false negatives. Corresponds to 'r' in the amplified probability (1-(1-s^r)^b), where s is the cosine similarity between two vectors.	required
`seed`	`int`	Random seed for reproducible hashes. If None (default), a random seed is used.	`None`

Examples:

>>> import numpy as np
>>> from culsh import RPLSH
>>>
>>> # Create random data
>>> X = np.random.randn(10000, 128).astype(np.float32)
>>> Q = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit model
>>> lsh = RPLSH(n_hash_tables=16, n_hashes=8)
>>> model = lsh.fit(X)
>>>
>>> # Query for candidates
>>> candidates = model.query(Q)
>>> indices = candidates.get_indices()
>>> counts = candidates.get_counts()
>>> offsets = candidates.get_offsets()

n_hash_tables `property` ¶

n_hash_tables: int

Number of hash tables.

n_hashes `property` ¶

n_hashes: int

Number of hash functions per hash table.

seed `property` ¶

seed: int

Random seed.

fit ¶

fit(X: Union[ndarray, ndarray]) -> RPLSHModel

Fit the RPLSH model on input data.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Input vectors. Can be numpy or cupy array.	required

Returns:

Type	Description
`RPLSHModel`	The fitted model containing the LSH index.

fit_query ¶

fit_query(X: Union[ndarray, ndarray], batch_size: Optional[int] = None) -> Candidates

Simultaneously fit and query the LSH index. This is more efficient than calling fit(X) followed by query(X) because it avoids a search step to find matching buckets. Note: input vectors are considered candidate neighbors of themselves.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Input vectors to fit and query. Can be numpy or cupy array.	required
`batch_size`	`int`	If specified, process queries in batches of this size to reduce peak memory usage. Note that this will fall back to calling fit(X) followed by query(X) with batching.	`None`

Returns:

Type	Description
`Candidates`	Query results containing candidate indices for each sample.

culsh.RPLSHModel ¶

RPLSHModel(n_hash_tables: int, n_hashes: int, n_features: int, core: RPLSHCore, index: RPLSHIndex)

Model produced by RPLSH.fit() containing the fitted LSH index.

Parameters:

Name	Type	Description	Default
`n_hash_tables`	`int`	Number of hash tables.	required
`n_hashes`	`int`	Number of hashes per hash table.	required
`n_features`	`int`	Number of features.	required
`core`	`RPLSHCore`	Core RPLSH object containing the fitted index.	required
`index`	`Index`	Fitted index.	required

index `property` ¶

index: RPLSHIndex

The fitted index.

n_features `property` ¶

n_features: int

Number of features.

n_hash_tables `property` ¶

n_hash_tables: int

Number of hash tables.

n_hashes `property` ¶

n_hashes: int

Number of hashes per hash table.

load `classmethod` ¶

load(path: str) -> RPLSHModel

Load the RPLSH model from an npz file.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to load the model from.	required

query ¶

query(Q: Union[ndarray, ndarray], batch_size: Optional[int] = None) -> Candidates

Find candidate neighbors for the query vectors Q.

Parameters:

Name	Type	Description	Default
`Q`	`array-like of shape (n_queries, n_features)`	Query vectors. Can be numpy or cupy array.	required
`batch_size`	`int`	If specified, process queries in batches of this size to reduce peak memory usage.	`None`

Returns:

Type	Description
`Candidates`	Query results containing candidate indices for each query.

save ¶

save(path: str) -> None

Save the RPLSH model to an npz file.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to save the model.	required

RPLSH¶

Classes¶

culsh.RPLSH ¶

n_hash_tables property ¶

n_hashes property ¶

seed property ¶

fit ¶

fit_query ¶

culsh.RPLSHModel ¶

index property ¶

n_features property ¶

n_hash_tables property ¶

n_hashes property ¶

load classmethod ¶

query ¶

save ¶

n_hash_tables `property` ¶

n_hashes `property` ¶

seed `property` ¶

index `property` ¶

n_features `property` ¶

n_hash_tables `property` ¶

n_hashes `property` ¶

load `classmethod` ¶