daft.functions.tokenize_encode#

tokenize_encode #

tokenize_encode(expr: Expression, tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None, use_special_tokens: bool | None = None) -> Expression

Encodes each string as a list of integer tokens using a tokenizer.

Uses https://github.com/openai/tiktoken for tokenization.

Supported built-in tokenizers: cl100k_base, o200k_base, p50k_base, p50k_edit, r50k_base. Also supports loading tokens from a file in tiktoken format.

Parameters:

Name	Type	Description	Default
`expr`	`Expression`	The expression to encode.	required
`tokens_path`	`str`	The name of a built-in tokenizer, or the path to a token file (supports downloading).	required
`io_config`	`optional`	IOConfig to use when accessing remote storage.	`None`
`pattern`	`optional`	Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.	`None`
`special_tokens`	`optional`	Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file.	`None`
`use_special_tokens`	`optional`	Whether or not to parse special tokens included in input. Disabled by default. Automatically enabled if `special_tokens` is provided.	`None`

Returns:

Name	Type	Description
`Expression`	`Expression`	An expression with the encodings of the strings as lists of unsigned 32-bit integers.

Note

If using this expression with Llama 3 tokens, note that Llama 3 does some extra preprocessing on strings in certain edge cases. This may result in slightly different encodings in these cases.

Source code in daft/functions/str.py

def tokenize_encode(
    expr: Expression,
    tokens_path: str,
    *,
    io_config: IOConfig | None = None,
    pattern: str | None = None,
    special_tokens: str | None = None,
    use_special_tokens: bool | None = None,
) -> Expression:
    """Encodes each string as a list of integer tokens using a tokenizer.

    Uses https://github.com/openai/tiktoken for tokenization.

    Supported built-in tokenizers: `cl100k_base`, `o200k_base`, `p50k_base`, `p50k_edit`, `r50k_base`. Also supports
    loading tokens from a file in tiktoken format.

    Args:
        expr: The expression to encode.
        tokens_path: The name of a built-in tokenizer, or the path to a token file (supports downloading).
        io_config (optional): IOConfig to use when accessing remote storage.
        pattern (optional): Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.
        special_tokens (optional): Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file.
        use_special_tokens (optional): Whether or not to parse special tokens included in input. Disabled by default. Automatically enabled if `special_tokens` is provided.

    Returns:
        Expression: An expression with the encodings of the strings as lists of unsigned 32-bit integers.

    Note:
        If using this expression with Llama 3 tokens, note that Llama 3 does some extra preprocessing on
        strings in certain edge cases. This may result in slightly different encodings in these cases.

    """
    return Expression._call_builtin_scalar_fn(
        "tokenize_encode",
        expr,
        tokens_path=tokens_path,
        io_config=io_config,
        pattern=pattern,
        special_tokens=special_tokens,
        use_special_tokens=use_special_tokens,
    )