Skip to content

daft.functions.tokenize_decode#

tokenize_decode #

tokenize_decode(expr: Expression, tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None) -> Expression

Decodes each list of integer tokens into a string using a tokenizer.

Uses https://github.com/openai/tiktoken for tokenization.

Supported built-in tokenizers: cl100k_base, o200k_base, p50k_base, p50k_edit, r50k_base. Also supports loading tokens from a file in tiktoken format.

Parameters:

Name Type Description Default
expr Expression

The expression to decode.

required
tokens_path str

The name of a built-in tokenizer, or the path to a token file (supports downloading).

required
io_config optional

IOConfig to use when accessing remote storage.

None
pattern optional

Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.

None
special_tokens optional

Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file.

None

Returns:

Name Type Description
Expression Expression

An expression with decoded strings.

Source code in daft/functions/str.py
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
def tokenize_decode(
    expr: Expression,
    tokens_path: str,
    *,
    io_config: IOConfig | None = None,
    pattern: str | None = None,
    special_tokens: str | None = None,
) -> Expression:
    """Decodes each list of integer tokens into a string using a tokenizer.

    Uses [https://github.com/openai/tiktoken](https://github.com/openai/tiktoken) for tokenization.

    Supported built-in tokenizers: `cl100k_base`, `o200k_base`, `p50k_base`, `p50k_edit`, `r50k_base`. Also supports
    loading tokens from a file in tiktoken format.

    Args:
        expr: The expression to decode.
        tokens_path: The name of a built-in tokenizer, or the path to a token file (supports downloading).
        io_config (optional): IOConfig to use when accessing remote storage.
        pattern (optional): Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.
        special_tokens (optional): Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file.

    Returns:
        Expression: An expression with decoded strings.
    """
    return Expression._call_builtin_scalar_fn(
        "tokenize_decode",
        expr,
        tokens_path=tokens_path,
        io_config=io_config,
        pattern=pattern,
        special_tokens=special_tokens,
    )