Skip to content

daft.functions.normalize#

normalize #

normalize(expr: Expression, *, remove_punct: bool = False, lowercase: bool = False, nfd_unicode: bool = False, white_space: bool = False) -> Expression

Normalizes a string for more useful deduplication.

Parameters:

Name Type Description Default
expr Expression

The expression to normalize.

required
remove_punct bool

Whether to remove all punctuation (ASCII).

False
lowercase bool

Whether to convert the string to lowercase.

False
nfd_unicode bool

Whether to normalize and decompose Unicode characters according to NFD.

False
white_space bool

Whether to normalize whitespace, replacing newlines etc with spaces and removing double spaces.

False

Returns:

Name Type Description
Expression Expression

a String expression which is normalized.

Note

All processing options are off by default.

Examples:

1
2
3
4
5
>>> import daft
>>> from daft.functions import normalize
>>> df = daft.from_pydict({"x": ["hello world", "Hello, world!", "HELLO,   \nWORLD!!!!"]})
>>> df = df.with_column("normalized", normalize(df["x"], remove_punct=True, lowercase=True, white_space=True))
>>> df.show()
╭───────────────┬─────────────╮
│ x             ┆ normalized  │
│ ---           ┆ ---         │
│ String        ┆ String      │
╞═══════════════╪═════════════╡
│ hello world   ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Hello, world! ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ HELLO,        ┆ hello world │
│ WORLD!!!!     ┆             │
╰───────────────┴─────────────╯
(Showing first 3 of 3 rows)
Source code in daft/functions/str.py
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
def normalize(
    expr: Expression,
    *,
    remove_punct: bool = False,
    lowercase: bool = False,
    nfd_unicode: bool = False,
    white_space: bool = False,
) -> Expression:
    r"""Normalizes a string for more useful deduplication.

    Args:
        expr: The expression to normalize.
        remove_punct: Whether to remove all punctuation (ASCII).
        lowercase: Whether to convert the string to lowercase.
        nfd_unicode: Whether to normalize and decompose Unicode characters according to NFD.
        white_space: Whether to normalize whitespace, replacing newlines etc with spaces and removing double spaces.

    Returns:
        Expression: a String expression which is normalized.

    Note:
        All processing options are off by default.

    Examples:
        >>> import daft
        >>> from daft.functions import normalize
        >>> df = daft.from_pydict({"x": ["hello world", "Hello, world!", "HELLO,   \nWORLD!!!!"]})
        >>> df = df.with_column("normalized", normalize(df["x"], remove_punct=True, lowercase=True, white_space=True))
        >>> df.show()
        ╭───────────────┬─────────────╮
        │ x             ┆ normalized  │
        │ ---           ┆ ---         │
        │ String        ┆ String      │
        ╞═══════════════╪═════════════╡
        │ hello world   ┆ hello world │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ Hello, world! ┆ hello world │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ HELLO,        ┆ hello world │
        │ WORLD!!!!     ┆             │
        ╰───────────────┴─────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

    """
    return Expression._call_builtin_scalar_fn(
        "normalize",
        expr,
        remove_punct=remove_punct,
        lowercase=lowercase,
        nfd_unicode=nfd_unicode,
        white_space=white_space,
    )