Skip to content

RPLSH

Random Projection LSH for approximate nearest neighbor search under cosine similarity.

Classes

culsh.RPLSH

RPLSH(n_hash_tables: int, n_hashes: int, seed: Optional[int] = None)

Locality sensitive hashing using random projections. This approximates cosine distance between vectors for ANN search.

Parameters:

Name Type Description Default
n_hash_tables int

Number of hash tables (OR-amplification of the locality-sensitive family). More tables provide additional independent chances to find neighbors, improving recall at the cost of more false positives. Corresponds to 'b' in the amplified probability (1-(1-s^r)^b), where s is the cosine similarity between two vectors.

required
n_hashes int

Number of hashes (random projections) per table (AND-amplification of the locality-sensitive family). More hashes require samples to agree on more hash bits, increasing precision at the cost of more false negatives. Corresponds to 'r' in the amplified probability (1-(1-s^r)^b), where s is the cosine similarity between two vectors.

required
seed int

Random seed for reproducible hashes. If None (default), a random seed is used.

None

Examples:

>>> import numpy as np
>>> from culsh import RPLSH
>>>
>>> # Create random data
>>> X = np.random.randn(10000, 128).astype(np.float32)
>>> Q = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit model
>>> lsh = RPLSH(n_hash_tables=16, n_hashes=8)
>>> model = lsh.fit(X)
>>>
>>> # Query for candidates
>>> candidates = model.query(Q)
>>> indices = candidates.get_indices()
>>> counts = candidates.get_counts()
>>> offsets = candidates.get_offsets()

n_hash_tables property

n_hash_tables: int

Number of hash tables.

n_hashes property

n_hashes: int

Number of hash functions per hash table.

seed property

seed: int

Random seed.

fit

fit(X: Union[ndarray, ndarray]) -> RPLSHModel

Fit the RPLSH model on input data.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

Input vectors. Can be numpy or cupy array.

required

Returns:

Type Description
RPLSHModel

The fitted model containing the LSH index.

fit_query

fit_query(X: Union[ndarray, ndarray], batch_size: Optional[int] = None) -> Candidates

Simultaneously fit and query the LSH index. This is more efficient than calling fit(X) followed by query(X) because it avoids a search step to find matching buckets. Note: input vectors are considered candidate neighbors of themselves.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

Input vectors to fit and query. Can be numpy or cupy array.

required
batch_size int

If specified, process queries in batches of this size to reduce peak memory usage. Note that this will fall back to calling fit(X) followed by query(X) with batching.

None

Returns:

Type Description
Candidates

Query results containing candidate indices for each sample.

culsh.RPLSHModel

RPLSHModel(n_hash_tables: int, n_hashes: int, n_features: int, core: RPLSHCore, index: RPLSHIndex)

Model produced by RPLSH.fit() containing the fitted LSH index.

Parameters:

Name Type Description Default
n_hash_tables int

Number of hash tables.

required
n_hashes int

Number of hashes per hash table.

required
n_features int

Number of features.

required
core RPLSHCore

Core RPLSH object containing the fitted index.

required
index Index

Fitted index.

required

index property

index: RPLSHIndex

The fitted index.

n_features property

n_features: int

Number of features.

n_hash_tables property

n_hash_tables: int

Number of hash tables.

n_hashes property

n_hashes: int

Number of hashes per hash table.

load classmethod

load(path: str) -> RPLSHModel

Load the RPLSH model from an npz file.

Parameters:

Name Type Description Default
path str

Path to load the model from.

required

query

query(Q: Union[ndarray, ndarray], batch_size: Optional[int] = None) -> Candidates

Find candidate neighbors for the query vectors Q.

Parameters:

Name Type Description Default
Q array-like of shape (n_queries, n_features)

Query vectors. Can be numpy or cupy array.

required
batch_size int

If specified, process queries in batches of this size to reduce peak memory usage.

None

Returns:

Type Description
Candidates

Query results containing candidate indices for each query.

save

save(path: str) -> None

Save the RPLSH model to an npz file.

Parameters:

Name Type Description Default
path str

Path to save the model.

required