Skip to content

daft.functions.regexp_extract_all#

regexp_extract_all #

regexp_extract_all(expr: Expression, pattern: str | Expression, index: int = 0) -> Expression

Extracts the specified match group from all regex matches in each string in a string column.

Parameters:

Name Type Description Default
expr Expression

String expression to extract from

required
pattern str | Expression

The regex pattern to extract

required
index int

The index of the regex match group to extract

0

Returns:

Name Type Description
Expression Expression

a List[String] expression with the extracted regex matches

Note

This expression always returns a list of strings. If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, an empty list is returned.

Examples:

1
2
3
4
5
6
>>> import daft
>>> from daft.functions import regexp_extract_all
>>>
>>> regex = r"(\d)(\d*)"
>>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
>>> df.with_column("match", regexp_extract_all(df["x"], regex)).collect()
╭─────────┬──────────────╮
│ x       ┆ match        │
│ ---     ┆ ---          │
│ String  ┆ List[String] │
╞═════════╪══════════════╡
│ 123-456 ┆ [123, 456]   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ [789, 012]   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ [345, 678]   │
╰─────────┴──────────────╯
(Showing first 3 of 3 rows)

Extract the first capture group

1
>>> df.with_column("match", regexp_extract_all(df["x"], regex, 1)).collect()
╭─────────┬──────────────╮
│ x       ┆ match        │
│ ---     ┆ ---          │
│ String  ┆ List[String] │
╞═════════╪══════════════╡
│ 123-456 ┆ [1, 4]       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ [7, 0]       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ [3, 6]       │
╰─────────┴──────────────╯
(Showing first 3 of 3 rows)
See Also

regexp_extract

Source code in daft/functions/str.py
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
def regexp_extract_all(expr: Expression, pattern: str | Expression, index: int = 0) -> Expression:
    r"""Extracts the specified match group from all regex matches in each string in a string column.

    Args:
        expr: String expression to extract from
        pattern: The regex pattern to extract
        index: The index of the regex match group to extract

    Returns:
        Expression: a List[String] expression with the extracted regex matches

    Note:
        This expression always returns a list of strings.
        If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, an empty list is returned.

    Examples:
        >>> import daft
        >>> from daft.functions import regexp_extract_all
        >>>
        >>> regex = r"(\d)(\d*)"
        >>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
        >>> df.with_column("match", regexp_extract_all(df["x"], regex)).collect()
        ╭─────────┬──────────────╮
        │ x       ┆ match        │
        │ ---     ┆ ---          │
        │ String  ┆ List[String] │
        ╞═════════╪══════════════╡
        │ 123-456 ┆ [123, 456]   │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 789-012 ┆ [789, 012]   │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 345-678 ┆ [345, 678]   │
        ╰─────────┴──────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        Extract the first capture group

        >>> df.with_column("match", regexp_extract_all(df["x"], regex, 1)).collect()
        ╭─────────┬──────────────╮
        │ x       ┆ match        │
        │ ---     ┆ ---          │
        │ String  ┆ List[String] │
        ╞═════════╪══════════════╡
        │ 123-456 ┆ [1, 4]       │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 789-012 ┆ [7, 0]       │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 345-678 ┆ [3, 6]       │
        ╰─────────┴──────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

    Tip: See Also
        [`regexp_extract`](https://docs.daft.ai/en/stable/api/functions/regexp_extract/)
    """
    return Expression._call_builtin_scalar_fn("regexp_extract_all", expr, pattern, index)