Skip to content

daft.functions.damerau_levenshtein_distance#

damerau_levenshtein_distance #

damerau_levenshtein_distance(left: Expression, right: Expression) -> Expression

Compute the Damerau-Levenshtein distance between two strings.

This extends the Levenshtein distance by also counting transpositions of two adjacent characters as a single edit operation (in addition to insertions, deletions, and substitutions).

Note

This computes the Optimal String Alignment (OSA) variant, which does not allow a substring to be edited more than once. Results may differ from the true Damerau-Levenshtein distance for inputs with overlapping transpositions (e.g., "CA" to "ABC" is 3 under OSA but 2 under true Damerau-Levenshtein). OSA does not satisfy the triangle inequality.

Parameters:

Name Type Description Default
left Expression

The left string expression to compare.

required
right Expression

The right string expression to compare against.

required

Returns:

Type Description
Expression

The Damerau-Levenshtein (OSA) distance for each pair of strings. Returns null

Expression

when either input is null.

Examples:

1
2
3
4
5
>>> import daft
>>> from daft.functions import damerau_levenshtein_distance
>>> df = daft.from_pydict({"x": ["abc", "abc", ""], "y": ["bac", "acb", "abc"]})
>>> df = df.with_column("distance", damerau_levenshtein_distance(df["x"], df["y"]))
>>> df.collect()
╭────────┬────────┬──────────╮
│ x      ┆ y      ┆ distance │
│ ---    ┆ ---    ┆ ---      │
│ String ┆ String ┆ Int64    │
╞════════╪════════╪══════════╡
│ abc    ┆ bac    ┆ 1        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ abc    ┆ acb    ┆ 1        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│        ┆ abc    ┆ 3        │
╰────────┴────────┴──────────╯
(Showing first 3 of 3 rows)
Source code in daft/functions/str.py
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
def damerau_levenshtein_distance(left: Expression, right: Expression) -> Expression:
    """Compute the Damerau-Levenshtein distance between two strings.

    This extends the Levenshtein distance by also counting transpositions of two
    adjacent characters as a single edit operation (in addition to insertions,
    deletions, and substitutions).

    Note:
        This computes the Optimal String Alignment (OSA) variant, which does not
        allow a substring to be edited more than once. Results may differ from the
        true Damerau-Levenshtein distance for inputs with overlapping transpositions
        (e.g., ``"CA"`` to ``"ABC"`` is 3 under OSA but 2 under true
        Damerau-Levenshtein). OSA does not satisfy the triangle inequality.

    Args:
        left: The left string expression to compare.
        right: The right string expression to compare against.

    Returns:
        The Damerau-Levenshtein (OSA) distance for each pair of strings. Returns null
        when either input is null.

    Examples:
        >>> import daft
        >>> from daft.functions import damerau_levenshtein_distance
        >>> df = daft.from_pydict({"x": ["abc", "abc", ""], "y": ["bac", "acb", "abc"]})
        >>> df = df.with_column("distance", damerau_levenshtein_distance(df["x"], df["y"]))
        >>> df.collect()
        ╭────────┬────────┬──────────╮
        │ x      ┆ y      ┆ distance │
        │ ---    ┆ ---    ┆ ---      │
        │ String ┆ String ┆ Int64    │
        ╞════════╪════════╪══════════╡
        │ abc    ┆ bac    ┆ 1        │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
        │ abc    ┆ acb    ┆ 1        │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
        │        ┆ abc    ┆ 3        │
        ╰────────┴────────┴──────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    return Expression._call_builtin_scalar_fn("damerau_levenshtein_distance", left, right)