daft.functions.tokenize_encode#
tokenize_encode #
tokenize_encode(expr: Expression, tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None, use_special_tokens: bool | None = None) -> Expression
Encodes each string as a list of integer tokens using a tokenizer.
Uses https://github.com/openai/tiktoken for tokenization.
Supported built-in tokenizers: cl100k_base, o200k_base, p50k_base, p50k_edit, r50k_base. Also supports loading tokens from a file in tiktoken format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr | Expression | The expression to encode. | required |
tokens_path | str | The name of a built-in tokenizer, or the path to a token file (supports downloading). | required |
io_config | optional | IOConfig to use when accessing remote storage. | None |
pattern | optional | Regex pattern to use to split strings in tokenization step. Necessary if loading from a file. | None |
special_tokens | optional | Name of the set of special tokens to use. Currently only "llama3" supported. Necessary if loading from a file. | None |
use_special_tokens | optional | Whether or not to parse special tokens included in input. Disabled by default. Automatically enabled if | None |
Returns:
| Name | Type | Description |
|---|---|---|
Expression | Expression | An expression with the encodings of the strings as lists of unsigned 32-bit integers. |
Note
If using this expression with Llama 3 tokens, note that Llama 3 does some extra preprocessing on strings in certain edge cases. This may result in slightly different encodings in these cases.
Source code in daft/functions/str.py
951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 | |