daft.functions.normalize#

normalize #

normalize(expr: Expression, *, remove_punct: bool = False, lowercase: bool = False, nfd_unicode: bool = False, white_space: bool = False) -> Expression

Normalizes a string for more useful deduplication.

Parameters:

Name	Type	Description	Default
`expr`	`Expression`	The expression to normalize.	required
`remove_punct`	`bool`	Whether to remove all punctuation (ASCII).	`False`
`lowercase`	`bool`	Whether to convert the string to lowercase.	`False`
`nfd_unicode`	`bool`	Whether to normalize and decompose Unicode characters according to NFD.	`False`
`white_space`	`bool`	Whether to normalize whitespace, replacing newlines etc with spaces and removing double spaces.	`False`

Returns:

Name	Type	Description
`Expression`	`Expression`	a String expression which is normalized.

Note

All processing options are off by default.

Examples:

>>> import daft
>>> from daft.functions import normalize
>>> df = daft.from_pydict({"x": ["hello world", "Hello, world!", "HELLO,   \nWORLD!!!!"]})
>>> df = df.with_column("normalized", normalize(df["x"], remove_punct=True, lowercase=True, white_space=True))
>>> df.show()

╭───────────────┬─────────────╮
│ x             ┆ normalized  │
│ ---           ┆ ---         │
│ String        ┆ String      │
╞═══════════════╪═════════════╡
│ hello world   ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Hello, world! ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ HELLO,        ┆ hello world │
│ WORLD!!!!     ┆             │
╰───────────────┴─────────────╯
(Showing first 3 of 3 rows)

Source code in daft/functions/str.py

def normalize(
    expr: Expression,
    *,
    remove_punct: bool = False,
    lowercase: bool = False,
    nfd_unicode: bool = False,
    white_space: bool = False,
) -> Expression:
    r"""Normalizes a string for more useful deduplication.

    Args:
        expr: The expression to normalize.
        remove_punct: Whether to remove all punctuation (ASCII).
        lowercase: Whether to convert the string to lowercase.
        nfd_unicode: Whether to normalize and decompose Unicode characters according to NFD.
        white_space: Whether to normalize whitespace, replacing newlines etc with spaces and removing double spaces.

    Returns:
        Expression: a String expression which is normalized.

    Note:
        All processing options are off by default.

    Examples:
        >>> import daft
        >>> from daft.functions import normalize
        >>> df = daft.from_pydict({"x": ["hello world", "Hello, world!", "HELLO,   \nWORLD!!!!"]})
        >>> df = df.with_column("normalized", normalize(df["x"], remove_punct=True, lowercase=True, white_space=True))
        >>> df.show()
        ╭───────────────┬─────────────╮
        │ x             ┆ normalized  │
        │ ---           ┆ ---         │
        │ String        ┆ String      │
        ╞═══════════════╪═════════════╡
        │ hello world   ┆ hello world │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ Hello, world! ┆ hello world │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ HELLO,        ┆ hello world │
        │ WORLD!!!!     ┆             │
        ╰───────────────┴─────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

    """
    return Expression._call_builtin_scalar_fn(
        "normalize",
        expr,
        remove_punct=remove_punct,
        lowercase=lowercase,
        nfd_unicode=nfd_unicode,
        white_space=white_space,
    )