Skip to content

daft.functions.regexp_extract#

regexp_extract #

regexp_extract(expr: Expression, pattern: str | Expression, index: int = 0) -> Expression

Extracts the specified match group from the first regex match in each string in a string column.

Parameters:

Name Type Description Default
expr Expression

String expression to extract from

required
pattern str | Expression

The regex pattern to extract

required
index int

The index of the regex match group to extract

0

Returns:

Name Type Description
Expression Expression

a String expression with the extracted regex match

Note

If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, a null value is returned.

Examples:

1
2
3
4
5
6
>>> import daft
>>> from daft.functions import regexp_extract
>>>
>>> regex = r"(\d)(\d*)"
>>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
>>> df.with_column("match", regexp_extract(df["x"], regex)).collect()
╭─────────┬────────╮
│ x       ┆ match  │
│ ---     ┆ ---    │
│ String  ┆ String │
╞═════════╪════════╡
│ 123-456 ┆ 123    │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ 789    │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ 345    │
╰─────────┴────────╯
(Showing first 3 of 3 rows)

Extract the first capture group

1
>>> df.with_column("match", regexp_extract(df["x"], regex, 1)).collect()
╭─────────┬────────╮
│ x       ┆ match  │
│ ---     ┆ ---    │
│ String  ┆ String │
╞═════════╪════════╡
│ 123-456 ┆ 1      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ 7      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ 3      │
╰─────────┴────────╯
(Showing first 3 of 3 rows)
See Also

regexp_extract_all

Source code in daft/functions/str.py
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
def regexp_extract(expr: Expression, pattern: str | Expression, index: int = 0) -> Expression:
    r"""Extracts the specified match group from the first regex match in each string in a string column.

    Args:
        expr: String expression to extract from
        pattern: The regex pattern to extract
        index: The index of the regex match group to extract

    Returns:
        Expression: a String expression with the extracted regex match

    Note:
        If index is 0, the entire match is returned.
        If the pattern does not match or the group does not exist, a null value is returned.

    Examples:
        >>> import daft
        >>> from daft.functions import regexp_extract
        >>>
        >>> regex = r"(\d)(\d*)"
        >>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
        >>> df.with_column("match", regexp_extract(df["x"], regex)).collect()
        ╭─────────┬────────╮
        │ x       ┆ match  │
        │ ---     ┆ ---    │
        │ String  ┆ String │
        ╞═════════╪════════╡
        │ 123-456 ┆ 123    │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 789-012 ┆ 789    │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 345-678 ┆ 345    │
        ╰─────────┴────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        Extract the first capture group

        >>> df.with_column("match", regexp_extract(df["x"], regex, 1)).collect()
        ╭─────────┬────────╮
        │ x       ┆ match  │
        │ ---     ┆ ---    │
        │ String  ┆ String │
        ╞═════════╪════════╡
        │ 123-456 ┆ 1      │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 789-012 ┆ 7      │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 345-678 ┆ 3      │
        ╰─────────┴────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)


    Tip: See Also
        [`regexp_extract_all`](https://docs.daft.ai/en/stable/api/functions/regexp_extract_all/)
    """
    return Expression._call_builtin_scalar_fn("regexp_extract", expr, pattern, index)