Skip to content

daft.functions.tokenize_encode#

tokenize_encode #

tokenize_encode(expr: Expression, tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None, use_special_tokens: bool | None = None) -> Expression

Encodes each string as a list of integer tokens using a tokenizer.

Uses https://github.com/openai/tiktoken for tokenization.

Supported built-in tokenizers: cl100k_base, o200k_base, p50k_base, p50k_edit, r50k_base. Also supports loading tokens from a file in tiktoken format.

Parameters:

Name Type Description Default
expr Expression

The expression to encode.

required
tokens_path str

The name of a built-in tokenizer, or the path to a token file (supports downloading).

required
io_config optional

IOConfig to use when accessing remote storage.

None
pattern optional

Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.

None
special_tokens optional

Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file.

None
use_special_tokens optional

Whether or not to parse special tokens included in input. Disabled by default. Automatically enabled if special_tokens is provided.

None

Returns:

Name Type Description
Expression Expression

An expression with the encodings of the strings as lists of unsigned 32-bit integers.

Note

If using this expression with Llama 3 tokens, note that Llama 3 does some extra preprocessing on strings in certain edge cases. This may result in slightly different encodings in these cases.

Source code in daft/functions/str.py
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
def tokenize_encode(
    expr: Expression,
    tokens_path: str,
    *,
    io_config: IOConfig | None = None,
    pattern: str | None = None,
    special_tokens: str | None = None,
    use_special_tokens: bool | None = None,
) -> Expression:
    """Encodes each string as a list of integer tokens using a tokenizer.

    Uses https://github.com/openai/tiktoken for tokenization.

    Supported built-in tokenizers: `cl100k_base`, `o200k_base`, `p50k_base`, `p50k_edit`, `r50k_base`. Also supports
    loading tokens from a file in tiktoken format.

    Args:
        expr: The expression to encode.
        tokens_path: The name of a built-in tokenizer, or the path to a token file (supports downloading).
        io_config (optional): IOConfig to use when accessing remote storage.
        pattern (optional): Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.
        special_tokens (optional): Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file.
        use_special_tokens (optional): Whether or not to parse special tokens included in input. Disabled by default. Automatically enabled if `special_tokens` is provided.

    Returns:
        Expression: An expression with the encodings of the strings as lists of unsigned 32-bit integers.

    Note:
        If using this expression with Llama 3 tokens, note that Llama 3 does some extra preprocessing on
        strings in certain edge cases. This may result in slightly different encodings in these cases.

    """
    return Expression._call_builtin_scalar_fn(
        "tokenize_encode",
        expr,
        tokens_path=tokens_path,
        io_config=io_config,
        pattern=pattern,
        special_tokens=special_tokens,
        use_special_tokens=use_special_tokens,
    )