daft.functions.minhash#
minhash #
minhash(text: Expression, *, num_hashes: int, ngram_size: int, seed: int = 1, hash_function: Literal['murmurhash3', 'xxhash', 'xxhash32', 'xxhash64', 'xxhash3_64', 'sha1'] = 'murmurhash3') -> Expression
Runs the MinHash algorithm on the series.
For a string, calculates the minimum hash over all its ngrams, repeating with num_hashes permutations. Returns as a list of 32-bit unsigned integers.
Tokens for the ngrams are delimited by spaces. The strings are not normalized or pre-processed, so it is recommended to normalize the strings yourself.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | String Expression | The expression to hash. | required |
num_hashes | int | The number of hash permutations to compute. | required |
ngram_size | int | The number of tokens in each shingle/ngram. | required |
seed | int, default=1 | Seed used for generating permutations and the initial string hashes. Defaults to 1. | 1 |
hash_function | str, default="murmurhash3" | Hash function to use for initial string hashing. One of "murmurhash3", "xxhash" (alias for "xxhash3_64"), "xxhash32", "xxhash64", "xxhash3_64", or "sha1". Defaults to "murmurhash3". | 'murmurhash3' |
Returns:
| Name | Type | Description |
|---|---|---|
Expression | FixedSizedList[UInt32, num_hashes] Expression | expression representing the MinHash values. |
Source code in daft/functions/misc.py
370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 | |