Aggregations#
When performing aggregations such as sum, mean and count, Daft enables you to group data by certain keys and aggregate within those keys.
Calling df.groupby() returns a GroupedDataFrame object which is a view of the original DataFrame but with additional context on which keys to group on. You can then call various aggregation methods to run the aggregation within each group, returning a new DataFrame.
GroupedDataFrame #
GroupedDataFrame(df: DataFrame, group_by: ExpressionsProjection)
Methods:
| Name | Description |
|---|---|
agg | Perform aggregations on this GroupedDataFrame. Allows for mixed aggregations. |
any_value | Returns an arbitrary value on this GroupedDataFrame. |
count | Performs grouped count on this GroupedDataFrame. |
count_distinct | Performs grouped count of distinct values on this GroupedDataFrame. |
list_agg | Performs grouped list on this GroupedDataFrame. |
list_agg_distinct | Performs grouped list distinct on this GroupedDataFrame (ignoring nulls). |
map_groups | Apply a user-defined function to each group. The name of the resultant column will default to the name of the first input column. |
max | Performs grouped max on this GroupedDataFrame. |
mean | Performs grouped mean on this GroupedDataFrame. |
min | Perform grouped min on this GroupedDataFrame. |
product | Performs grouped product on this GroupedDataFrame. |
skew | Performs grouped skew on this GroupedDataFrame. |
stddev | Performs grouped standard deviation on this GroupedDataFrame. |
string_agg | Performs grouped string concat on this GroupedDataFrame. |
sum | Perform grouped sum on this GroupedDataFrame. |
var | Performs grouped variance on this GroupedDataFrame. |
Attributes:
| Name | Type | Description |
|---|---|---|
df | DataFrame | |
group_by | ExpressionsProjection | |
agg #
agg(*to_agg: Expression | Iterable[Expression]) -> DataFrame
Perform aggregations on this GroupedDataFrame. Allows for mixed aggregations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*to_agg | Union[Expression, Iterable[Expression]] | aggregation expressions | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped aggregations |
Examples:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
╭────────┬─────────┬─────────┬────────┬────────╮
│ pet ┆ min_age ┆ max_age ┆ count ┆ name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ String ┆ Int64 ┆ Int64 ┆ UInt64 ┆ String │
╞════════╪═════════╪═════════╪════════╪════════╡
│ cat ┆ 1 ┆ 4 ┆ 2 ┆ Alex │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ dog ┆ 2 ┆ 3 ┆ 2 ┆ Jordan │
╰────────┴─────────┴─────────┴────────┴────────╯
(Showing first 2 of 2 rows) Source code in daft/dataframe/dataframe.py
6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 | |
any_value #
any_value(*cols: ColumnInputType) -> DataFrame
Returns an arbitrary value on this GroupedDataFrame.
Values for each column are not guaranteed to be from the same row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to get | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with any values. |
Source code in daft/dataframe/dataframe.py
5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 | |
count #
count(*cols: ColumnInputType) -> DataFrame
Performs grouped count on this GroupedDataFrame.
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped count per column. |
Source code in daft/dataframe/dataframe.py
5954 5955 5956 5957 5958 5959 5960 | |
count_distinct #
count_distinct(*cols: ColumnInputType) -> DataFrame
Performs grouped count of distinct values on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to count distinct values | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped count of distinct values per column. |
Examples:
1 2 3 4 5 | |
╭────────┬────────╮
│ keys ┆ vals │
│ --- ┆ --- │
│ String ┆ UInt64 │
╞════════╪════════╡
│ a ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
╰────────┴────────╯
(Showing first 2 of 2 rows) Source code in daft/dataframe/dataframe.py
6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 | |
list_agg #
list_agg(*cols: ColumnInputType) -> DataFrame
Performs grouped list on this GroupedDataFrame.
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped list per column. |
Source code in daft/dataframe/dataframe.py
6030 6031 6032 6033 6034 6035 6036 | |
list_agg_distinct #
list_agg_distinct(*cols: ColumnInputType) -> DataFrame
Performs grouped list distinct on this GroupedDataFrame (ignoring nulls).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to form into a set | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped list distinct per column. |
Source code in daft/dataframe/dataframe.py
6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 | |
map_groups #
map_groups(udf: Expression) -> DataFrame
Apply a user-defined function to each group. The name of the resultant column will default to the name of the first input column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
udf | Expression | User-defined function to apply to each group. | required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped aggregations |
Examples:
1 2 3 4 5 6 7 8 9 10 11 | |
╭────────┬────────────────────╮
│ group ┆ data │
│ --- ┆ --- │
│ String ┆ Float64 │
╞════════╪════════════════════╡
│ a ┆ 14.730919862656235 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 331.62026476076517 │
╰────────┴────────────────────╯
(Showing first 2 of 2 rows) Source code in daft/dataframe/dataframe.py
6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 | |
max #
max(*cols: ColumnInputType) -> DataFrame
Performs grouped max on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to max | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped max. |
Source code in daft/dataframe/dataframe.py
5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 | |
mean #
mean(*cols: ColumnInputType) -> DataFrame
Performs grouped mean on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to mean | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped mean. |
Source code in daft/dataframe/dataframe.py
5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 | |
min #
min(*cols: ColumnInputType) -> DataFrame
Perform grouped min on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to min | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped min. |
Source code in daft/dataframe/dataframe.py
5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 | |
product #
product(*cols: ColumnInputType) -> DataFrame
Performs grouped product on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to product | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped products. |
Examples:
1 2 3 4 5 | |
╭────────┬───────╮
│ keys ┆ col_a │
│ --- ┆ --- │
│ String ┆ Int64 │
╞════════╪═══════╡
│ a ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 100 │
╰────────┴───────╯
(Showing first 2 of 2 rows) Source code in daft/dataframe/dataframe.py
5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 | |
skew #
skew(*cols: ColumnInputType) -> DataFrame
Performs grouped skew on this GroupedDataFrame.
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with the grouped skew per column. |
Source code in daft/dataframe/dataframe.py
5962 5963 5964 5965 5966 5967 5968 | |
stddev #
stddev(*cols: ColumnInputType, ddof: int = 1) -> DataFrame
Performs grouped standard deviation on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to stddev | () |
ddof | int | Delta degrees of freedom used in the denominator | 1 |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped standard deviation. |
Examples:
1 2 3 4 5 | |
╭────────┬─────────╮
│ keys ┆ col_a │
│ --- ┆ --- │
│ String ┆ Float64 │
╞════════╪═════════╡
│ a ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b ┆ None │
╰────────┴─────────╯
(Showing first 2 of 2 rows) Source code in daft/dataframe/dataframe.py
5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 | |
string_agg #
string_agg(*cols: ColumnInputType, delimiter: str | None = None) -> DataFrame
Performs grouped string concat on this GroupedDataFrame.
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped string concatenated per column. |
Source code in daft/dataframe/dataframe.py
6049 6050 6051 6052 6053 6054 6055 | |
sum #
sum(*cols: ColumnInputType) -> DataFrame
Perform grouped sum on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to sum | () |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped sums. |
Source code in daft/dataframe/dataframe.py
5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 | |
var #
var(*cols: ColumnInputType, ddof: int = 1) -> DataFrame
Performs grouped variance on this GroupedDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*cols | Union[str, Expression] | columns to compute variance for | () |
ddof | int | Delta degrees of freedom used in the denominator | 1 |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame | DataFrame | DataFrame with grouped variance. |
Examples:
1 2 3 4 5 | |
╭────────┬─────────╮
│ keys ┆ col_a │
│ --- ┆ --- │
│ String ┆ Float64 │
╞════════╪═════════╡
│ a ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b ┆ None │
╰────────┴─────────╯
(Showing first 2 of 2 rows) Source code in daft/dataframe/dataframe.py
5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 | |