daft.functions.simhash#
simhash #
simhash(text: Expression, *, ngram_size: int = 3, hash_function: Literal['murmurhash3', 'xxhash', 'xxhash32', 'xxhash64', 'xxhash3_64', 'sha1'] = 'xxhash3_64') -> Expression
Compute a SimHash fingerprint of the input text.
SimHash produces a 64-bit locality-sensitive hash from character n-grams. Similar texts produce fingerprints with small bitwise Hamming distance, making it useful for near-duplicate detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | String Expression | The expression to hash. | required |
ngram_size | int, default=3 | Character n-gram size. Defaults to 3. | 3 |
hash_function | str, default="xxhash3_64" | Hash function for n-grams. One of "murmurhash3", "xxhash" (alias for "xxhash3_64"), "xxhash32", "xxhash64", "xxhash3_64", or "sha1". Defaults to "xxhash3_64". | 'xxhash3_64' |
Returns:
| Name | Type | Description |
|---|---|---|
Expression | UInt64 Expression | SimHash fingerprint. |
Source code in daft/functions/misc.py
404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 | |