Skip to content

daft.functions.regexp_split#

regexp_split #

regexp_split(expr: Expression, pattern: str | Expression) -> Expression

Splits each string on the given regex pattern, into a list of strings.

Parameters:

Name Type Description Default
expr Expression

The expression to split.

required
pattern str | Expression

The pattern on which each string should be split, or a column to pick such patterns from.

required

Returns:

Name Type Description
Expression Expression

A List[String] expression containing the string splits for each string in the column.

Examples:

1
2
3
4
5
>>> import daft
>>> from daft.functions import regexp_split
>>>
>>> df = daft.from_pydict({"data": ["daft.distributed...query", "a.....b.c", "1.2...3.."]})
>>> df.with_column("split", regexp_split(df["data"], r"\.+")).collect()
╭──────────────────────────┬────────────────────────────╮
│ data                     ┆ split                      │
│ ---                      ┆ ---                        │
│ String                   ┆ List[String]               │
╞══════════════════════════╪════════════════════════════╡
│ daft.distributed...query ┆ [daft, distributed, query] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a.....b.c                ┆ [a, b, c]                  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.2...3..                ┆ [1, 2, 3, ]                │
╰──────────────────────────┴────────────────────────────╯
(Showing first 3 of 3 rows)
Source code in daft/functions/str.py
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
def regexp_split(expr: Expression, pattern: str | Expression) -> Expression:
    r"""Splits each string on the given regex pattern, into a list of strings.

    Args:
        expr: The expression to split.
        pattern: The pattern on which each string should be split, or a column to pick such patterns from.

    Returns:
        Expression: A List[String] expression containing the string splits for each string in the column.

    Examples:
        >>> import daft
        >>> from daft.functions import regexp_split
        >>>
        >>> df = daft.from_pydict({"data": ["daft.distributed...query", "a.....b.c", "1.2...3.."]})
        >>> df.with_column("split", regexp_split(df["data"], r"\.+")).collect()
        ╭──────────────────────────┬────────────────────────────╮
        │ data                     ┆ split                      │
        │ ---                      ┆ ---                        │
        │ String                   ┆ List[String]               │
        ╞══════════════════════════╪════════════════════════════╡
        │ daft.distributed...query ┆ [daft, distributed, query] │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ a.....b.c                ┆ [a, b, c]                  │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 1.2...3..                ┆ [1, 2, 3, ]                │
        ╰──────────────────────────┴────────────────────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    return Expression._call_builtin_scalar_fn("regexp_split", expr, pattern)