daft.functions.tokenize_decode#
tokenize_decode #
tokenize_decode(expr: Expression, tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None) -> Expression
Decodes each list of integer tokens into a string using a tokenizer.
Uses https://github.com/openai/tiktoken for tokenization.
Supported built-in tokenizers: cl100k_base, o200k_base, p50k_base, p50k_edit, r50k_base. Also supports loading tokens from a file in tiktoken format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr | Expression | The expression to decode. | required |
tokens_path | str | The name of a built-in tokenizer, or the path to a token file (supports downloading). | required |
io_config | optional | IOConfig to use when accessing remote storage. | None |
pattern | optional | Regex pattern to use to split strings in tokenization step. Necessary if loading from a file. | None |
special_tokens | optional | Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file. | None |
Returns:
| Name | Type | Description |
|---|---|---|
Expression | Expression | An expression with decoded strings. |
Source code in daft/functions/str.py
994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 | |