daft.functions.normalize#
normalize #
normalize(expr: Expression, *, remove_punct: bool = False, lowercase: bool = False, nfd_unicode: bool = False, white_space: bool = False) -> Expression
Normalizes a string for more useful deduplication.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expr | Expression | The expression to normalize. | required |
remove_punct | bool | Whether to remove all punctuation (ASCII). | False |
lowercase | bool | Whether to convert the string to lowercase. | False |
nfd_unicode | bool | Whether to normalize and decompose Unicode characters according to NFD. | False |
white_space | bool | Whether to normalize whitespace, replacing newlines etc with spaces and removing double spaces. | False |
Returns:
| Name | Type | Description |
|---|---|---|
Expression | Expression | a String expression which is normalized. |
Note
All processing options are off by default.
Examples:
1 2 3 4 5 | |
╭───────────────┬─────────────╮
│ x ┆ normalized │
│ --- ┆ --- │
│ String ┆ String │
╞═══════════════╪═════════════╡
│ hello world ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Hello, world! ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ HELLO, ┆ hello world │
│ WORLD!!!! ┆ │
╰───────────────┴─────────────╯
(Showing first 3 of 3 rows) Source code in daft/functions/str.py
896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 | |