Skip to content

daft.functions.minhash#

minhash #

minhash(text: Expression, *, num_hashes: int, ngram_size: int, seed: int = 1, hash_function: Literal['murmurhash3', 'xxhash', 'xxhash32', 'xxhash64', 'xxhash3_64', 'sha1'] = 'murmurhash3') -> Expression

Runs the MinHash algorithm on the series.

For a string, calculates the minimum hash over all its ngrams, repeating with num_hashes permutations. Returns as a list of 32-bit unsigned integers.

Tokens for the ngrams are delimited by spaces. The strings are not normalized or pre-processed, so it is recommended to normalize the strings yourself.

Parameters:

Name Type Description Default
text String Expression

The expression to hash.

required
num_hashes int

The number of hash permutations to compute.

required
ngram_size int

The number of tokens in each shingle/ngram.

required
seed int, default=1

Seed used for generating permutations and the initial string hashes. Defaults to 1.

1
hash_function str, default="murmurhash3"

Hash function to use for initial string hashing. One of "murmurhash3", "xxhash" (alias for "xxhash3_64"), "xxhash32", "xxhash64", "xxhash3_64", or "sha1". Defaults to "murmurhash3".

'murmurhash3'

Returns:

Name Type Description
Expression FixedSizedList[UInt32, num_hashes] Expression

expression representing the MinHash values.

Source code in daft/functions/misc.py
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
def minhash(
    text: Expression,
    *,
    num_hashes: int,
    ngram_size: int,
    seed: int = 1,
    hash_function: Literal["murmurhash3", "xxhash", "xxhash32", "xxhash64", "xxhash3_64", "sha1"] = "murmurhash3",
) -> Expression:
    """Runs the MinHash algorithm on the series.

    For a string, calculates the minimum hash over all its ngrams,
    repeating with `num_hashes` permutations. Returns as a list of 32-bit unsigned integers.

    Tokens for the ngrams are delimited by spaces.
    The strings are not normalized or pre-processed, so it is recommended
    to normalize the strings yourself.

    Args:
        text (String Expression): The expression to hash.
        num_hashes (int): The number of hash permutations to compute.
        ngram_size (int): The number of tokens in each shingle/ngram.
        seed (int, default=1): Seed used for generating permutations and the initial string hashes. Defaults to 1.
        hash_function (str, default="murmurhash3"): Hash function to use for initial string hashing. One of "murmurhash3", "xxhash" (alias for "xxhash3_64"), "xxhash32", "xxhash64", "xxhash3_64", or "sha1". Defaults to "murmurhash3".

    Returns:
        Expression (FixedSizedList[UInt32, num_hashes] Expression):
            expression representing the MinHash values.

    """
    return Expression._call_builtin_scalar_fn(
        "minhash", text, num_hashes=num_hashes, ngram_size=ngram_size, seed=seed, hash_function=hash_function
    )