Skip to content

daft.functions.regexp_extract#

regexp_extract #

regexp_extract(expr: Expression, pattern: str | Expression, index: int = 0) -> Expression

Extracts the specified match group from the first regex match in each string in a string column.

Parameters:

Name Type Description Default
expr Expression

String expression to extract from

required
pattern str | Expression

The regex pattern to extract

required
index int

The index of the regex match group to extract

0

Returns:

Name Type Description
Expression Expression

a String expression with the extracted regex match

Note

If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, a null value is returned.

Examples:

1
2
3
4
5
6
>>> import daft
>>> from daft.functions import regexp_extract
>>>
>>> regex = r"(\d)(\d*)"
>>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
>>> df.with_column("match", regexp_extract(df["x"], regex)).collect()
╭─────────┬────────╮
│ x       ┆ match  │
│ ---     ┆ ---    │
│ String  ┆ String │
╞═════════╪════════╡
│ 123-456 ┆ 123    │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ 789    │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ 345    │
╰─────────┴────────╯
(Showing first 3 of 3 rows)

Extract the first capture group

1
>>> df.with_column("match", regexp_extract(df["x"], regex, 1)).collect()
╭─────────┬────────╮
│ x       ┆ match  │
│ ---     ┆ ---    │
│ String  ┆ String │
╞═════════╪════════╡
│ 123-456 ┆ 1      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ 7      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ 3      │
╰─────────┴────────╯
(Showing first 3 of 3 rows)
See Also

regexp_extract_all

Source code in daft/functions/str.py
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
def regexp_extract(expr: Expression, pattern: str | Expression, index: int = 0) -> Expression:
    r"""Extracts the specified match group from the first regex match in each string in a string column.

    Args:
        expr: String expression to extract from
        pattern: The regex pattern to extract
        index: The index of the regex match group to extract

    Returns:
        Expression: a String expression with the extracted regex match

    Note:
        If index is 0, the entire match is returned.
        If the pattern does not match or the group does not exist, a null value is returned.

    Examples:
        >>> import daft
        >>> from daft.functions import regexp_extract
        >>>
        >>> regex = r"(\d)(\d*)"
        >>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
        >>> df.with_column("match", regexp_extract(df["x"], regex)).collect()
        ╭─────────┬────────╮
        │ x       ┆ match  │
        │ ---     ┆ ---    │
        │ String  ┆ String │
        ╞═════════╪════════╡
        │ 123-456 ┆ 123    │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 789-012 ┆ 789    │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 345-678 ┆ 345    │
        ╰─────────┴────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        Extract the first capture group

        >>> df.with_column("match", regexp_extract(df["x"], regex, 1)).collect()
        ╭─────────┬────────╮
        │ x       ┆ match  │
        │ ---     ┆ ---    │
        │ String  ┆ String │
        ╞═════════╪════════╡
        │ 123-456 ┆ 1      │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 789-012 ┆ 7      │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 345-678 ┆ 3      │
        ╰─────────┴────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)


    Tip: See Also
        [`regexp_extract_all`](https://docs.daft.ai/en/stable/api/functions/regexp_extract_all/)
    """
    return Expression._call_builtin_scalar_fn("regexp_extract", expr, pattern, index)