Calculates the approximate percentile(s) for a column of numeric values.
For numeric columns, we use the sketches_ddsketch crate. This is a Rust implementation of the paper DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees (Masson et al.)
- Null values are ignored in the computation of the percentiles
- If all values are Null then the result will also be Null
- If
percentiles are supplied as a single float, then the resultant column is a Float64 column - If
percentiles is supplied as a list, then the resultant column is a FixedSizeList[Float64; N] column, where N is the length of the supplied list.
Parameters:
| Name | Type | Description | Default |
percentiles | float | list[float] | the percentile(s) at which to find approximate values at. Can be provided as a single float or a list of floats. | required |
Returns:
| Type | Description |
Expression | A new expression representing the approximate percentile(s). If percentiles was a single float, this will be a new Float64 expression. If percentiles was a list of floats, this will be a new expression with type: FixedSizeList[Float64, len(percentiles)]. |
Examples:
A global calculation of approximate percentiles:
| >>> import daft
>>> from daft.functions import approx_percentiles
>>>
>>> df = daft.from_pydict({"scores": [1, 2, 3, 4, 5, None]})
>>> df = df.agg(
... approx_percentiles(df["scores"], 0.5).alias("approx_median_score"),
... approx_percentiles(df["scores"], [0.25, 0.5, 0.75]).alias("approx_percentiles_scores"),
... )
>>> df.show()
|
╭─────────────────────┬────────────────────────────────╮
│ approx_median_score ┆ approx_percentiles_scores │
│ --- ┆ --- │
│ Float64 ┆ List[Float64; 3] │
╞═════════════════════╪════════════════════════════════╡
│ 2.9742334234767167 ┆ [1.993661701417351, 2.9742334… │
╰─────────────────────┴────────────────────────────────╯
(Showing first 1 of 1 rows)
A grouped calculation of approximate percentiles:
| >>> df = daft.from_pydict({"class": ["a", "a", "a", "b", "c"], "scores": [1, 2, 3, 1, None]})
>>> df = (
... df.groupby("class")
... .agg(
... approx_percentiles(df["scores"], 0.5).alias("approx_median_score"),
... approx_percentiles(df["scores"], [0.25, 0.5, 0.75]).alias("approx_percentiles_scores"),
... )
... .sort("class")
... )
>>> df.show()
|
╭────────┬─────────────────────┬────────────────────────────────╮
│ class ┆ approx_median_score ┆ approx_percentiles_scores │
│ --- ┆ --- ┆ --- │
│ String ┆ Float64 ┆ List[Float64; 3] │
╞════════╪═════════════════════╪════════════════════════════════╡
│ a ┆ 1.993661701417351 ┆ [0.9900000000000001, 1.993661… │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 0.9900000000000001 ┆ [0.9900000000000001, 0.990000… │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ None ┆ None │
╰────────┴─────────────────────┴────────────────────────────────╯
(Showing first 3 of 3 rows)
Source code in daft/functions/agg.py
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141 | def approx_percentiles(expr: Expression, percentiles: float | list[float]) -> Expression:
"""Calculates the approximate percentile(s) for a column of numeric values.
For numeric columns, we use the [sketches_ddsketch crate](https://docs.rs/sketches-ddsketch/latest/sketches_ddsketch/index.html).
This is a Rust implementation of the paper [DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees (Masson et al.)](https://arxiv.org/pdf/1908.10693)
1. Null values are ignored in the computation of the percentiles
2. If all values are Null then the result will also be Null
3. If ``percentiles`` are supplied as a single float, then the resultant column is a ``Float64`` column
4. If ``percentiles`` is supplied as a list, then the resultant column is a ``FixedSizeList[Float64; N]`` column, where ``N`` is the length of the supplied list.
Args:
percentiles: the percentile(s) at which to find approximate values at. Can be provided as a single
float or a list of floats.
Returns:
A new expression representing the approximate percentile(s). If `percentiles` was a single float, this will be a new `Float64` expression. If `percentiles` was a list of floats, this will be a new expression with type: `FixedSizeList[Float64, len(percentiles)]`.
Examples:
A global calculation of approximate percentiles:
>>> import daft
>>> from daft.functions import approx_percentiles
>>>
>>> df = daft.from_pydict({"scores": [1, 2, 3, 4, 5, None]})
>>> df = df.agg(
... approx_percentiles(df["scores"], 0.5).alias("approx_median_score"),
... approx_percentiles(df["scores"], [0.25, 0.5, 0.75]).alias("approx_percentiles_scores"),
... )
>>> df.show()
╭─────────────────────┬────────────────────────────────╮
│ approx_median_score ┆ approx_percentiles_scores │
│ --- ┆ --- │
│ Float64 ┆ List[Float64; 3] │
╞═════════════════════╪════════════════════════════════╡
│ 2.9742334234767167 ┆ [1.993661701417351, 2.9742334… │
╰─────────────────────┴────────────────────────────────╯
<BLANKLINE>
(Showing first 1 of 1 rows)
A grouped calculation of approximate percentiles:
>>> df = daft.from_pydict({"class": ["a", "a", "a", "b", "c"], "scores": [1, 2, 3, 1, None]})
>>> df = (
... df.groupby("class")
... .agg(
... approx_percentiles(df["scores"], 0.5).alias("approx_median_score"),
... approx_percentiles(df["scores"], [0.25, 0.5, 0.75]).alias("approx_percentiles_scores"),
... )
... .sort("class")
... )
>>> df.show()
╭────────┬─────────────────────┬────────────────────────────────╮
│ class ┆ approx_median_score ┆ approx_percentiles_scores │
│ --- ┆ --- ┆ --- │
│ String ┆ Float64 ┆ List[Float64; 3] │
╞════════╪═════════════════════╪════════════════════════════════╡
│ a ┆ 1.993661701417351 ┆ [0.9900000000000001, 1.993661… │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 0.9900000000000001 ┆ [0.9900000000000001, 0.990000… │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ None ┆ None │
╰────────┴─────────────────────┴────────────────────────────────╯
<BLANKLINE>
(Showing first 3 of 3 rows)
"""
return Expression._from_pyexpr(expr._expr.approx_percentiles(percentiles))
|