Skip to content

daft.functions#

Built-in Daft Functions

AI Functions#

classify_image(image, labels[, provider, model], **options) Returns an expression that classifies images using the specified model and provider.
classify_text(text, labels[, provider, model], **options) Returns an expression that classifies text using the specified model and provider.
embed_image(image[, provider, model], **options) Returns an expression that embeds images using the specified image model and provider.
embed_text(text[, provider, model, dimensions], **options) Returns an expression that embeds text using the specified embedding model and provider.
prompt(messages[, return_format, system_message, provider, model], **options) Returns an expression that prompts a large language model using the specified model and provider.

Aggregate Functions#

any_value(expr[, ignore_nulls]) Returns any non-null value from the expression.
approx_count_distinct(expr) Calculates the approximate number of non-`NULL` distinct values in the expression.
approx_percentiles(expr, percentiles) Calculates the approximate percentile(s) for a column of numeric values.
avg(expr) Calculates the mean of the values in the expression. Alias for mean().
bool_and(expr) Calculates the boolean AND of all values in the expression.
bool_or(expr) Calculates the boolean OR of all values in the expression.
count([, expr, mode]) Counts the number of values in the expression.
count_distinct(expr) Counts the number of distinct values in the expression.
list_agg(expr) Aggregates the values in the expression into a list.
list_agg_distinct(expr) Aggregates the values in the expression into a list of distinct values (ignoring nulls).
max(expr) Calculates the maximum of the values in the expression.
mean(expr) Calculates the mean of the values in the expression.
median(expr) Calculates the median of the values in the expression.
min(expr) Calculates the minimum of the values in the expression.
percentile(expr, percentage) Calculates the exact percentile for a column of numeric values.
product(expr) Calculates the product of the values in the expression.
skew(expr) Calculates the skewness of the values from the expression.
stddev(expr[, ddof]) Calculates the standard deviation of the values in the expression.
string_agg(expr[, delimiter]) Aggregates the values in the expression into a single string by concatenating them.
sum(expr) Calculates the sum of the values in the expression.
var(expr[, ddof]) Calculates the variance of the values in the expression.

Audio Functions#

audio_metadata(file_expr) Get metadata for a audio file.
resample(file_expr, sample_rate) Resample a audio file.

Binary Functions#

compress(expr, codec) Compress binary or string values using the specified codec.
decode(bytes, charset) Decodes binary values using the specified character set.
decompress(bytes, codec) Decompress binary values using the specified codec.
encode(expr, charset) Encode binary or string values using the specified character set.
try_compress(expr, codec) Compress or null if unsuccessful.
try_decode(bytes, charset) Decode or null if unsuccessful.
try_decompress(expr, codec) Decompress or null if unsuccessful.
try_encode(expr, charset) Encode or null if unsuccessful.

Bitwise Functions#

bitwise_and(left, right) Bitwise AND of two integer expressions.
bitwise_or(left, right) Bitwise OR of two integer expressions.
bitwise_xor(left, right) Bitwise XOR of two integer expressions.
shift_left(expr, num_bits) Shifts the bits of an integer expression to the left (``expr << num_bits``).
shift_right(expr, num_bits) Shifts the bits of an integer expression to the right (``expr >> num_bits``).

Columnar Functions#

columns_avg(*exprs) Average values across columns. Akin to `columns_mean`.
columns_max(*exprs) Find the maximum value across columns.
columns_mean(*exprs) Average values across columns. Akin to `columns_avg`.
columns_min(*exprs) Find the minimum value across columns.
columns_sum(*exprs) Sum values across columns.

Date and Time Functions#

add_months(expr, months) Adds a number of months to a date or timestamp.
convert_time_zone(expr, to_timezone[, from_timezone]) Converts a timestamp to another timezone while preserving the instant in time.
current_date() Returns the current date (UTC).
current_timestamp() Returns the current timestamp (UTC) with microsecond precision.
current_timezone() Returns the current timezone as a string (always 'UTC' in Daft).
date(expr) Retrieves the date for a datetime column.
date_add(expr, days) Adds a number of days to a date.
date_diff(end, start) Returns the number of days between two dates.
date_from_unix_date(expr) Converts days since Unix epoch (1970-01-01) to a date.
date_sub(expr, days) Subtracts a number of days from a date.
date_trunc(interval, expr[, relative_to]) Truncates the datetime column to the specified interval.
day(expr) Retrieves the day for a datetime column.
day_of_month(expr) Retrieves the day of the month for a datetime column.
day_of_week(expr) Retrieves the day of the week for a datetime column, starting at 0 for Monday and ending at 6 for Sunday.
day_of_year(expr) Retrieves the ordinal day for a datetime column. Starting at 1 for January 1st and ending at 365 or 366 for December 31st.
from_unixtime(expr[, format]) Converts a Unix timestamp (seconds) to a formatted string.
hour(expr) Retrieves the hour for a datetime column.
last_day(expr) Returns the last day of the month for the given date or timestamp.
make_date(year, month, day) Creates a date from year, month, and day integer components.
make_timestamp(year, month, day, hour, minute, second[, timezone]) Creates a timestamp from individual date/time components.
make_timestamp_ltz(year, month, day, hour, minute, second[, timezone]) Creates a UTC timestamp from individual date/time components.
microsecond(expr) Retrieves the microsecond for a datetime column.
millisecond(expr) Retrieves the millisecond for a datetime column.
minute(expr) Retrieves the minute for a datetime column.
month(expr) Retrieves the month for a datetime column.
months_between(end, start) Returns the number of months between two dates or timestamps.
nanosecond(expr) Retrieves the nanosecond for a datetime column.
next_day(expr, day_of_week) Returns the next occurrence of the specified day of the week after the given date.
quarter(expr) Retrieves the quarter for a datetime column.
replace_time_zone(expr[, timezone]) Replaces the timezone of a timestamp while preserving the local time.
second(expr) Retrieves the second for a datetime column.
strftime(expr[, format]) Converts a datetime/date column to a string column.
time(expr) Retrieves the time for a datetime column.
timestamp_micros(expr) Creates a timestamp from microseconds since Unix epoch.
timestamp_millis(expr) Creates a timestamp from milliseconds since Unix epoch.
timestamp_seconds(expr) Creates a timestamp from seconds since Unix epoch.
to_date(expr, format) Converts a string to a date using the specified format.
to_datetime(expr, format[, timezone]) Converts a string to a datetime using the specified format and timezone.
to_unix_epoch(expr[, time_unit]) Converts a datetime column to a Unix timestamp with the specified time unit. (default: seconds).
total_days(expr) Calculates the total number of days for a duration column.
total_hours(expr) Calculates the total number of hours for a duration column.
total_microseconds(expr) Calculates the total number of microseconds for a duration column.
total_milliseconds(expr) Calculates the total number of milliseconds for a duration column.
total_minutes(expr) Calculates the total number of minutes for a duration column.
total_nanoseconds(expr) Calculates the total number of nanoseconds for a duration column.
total_seconds(expr) Calculates the total number of seconds for a duration column.
unix_date(expr) Retrieves the number of days since 1970-01-01 00:00:00 UTC.
week_of_year(expr) Retrieves the week of the year for a datetime column.
year(expr) Retrieves the year for a datetime column.

Distance functions for vector inputs#

cosine_distance(left, right) Compute the cosine distance between two embeddings.
dot_product(left, right) Compute the dot product between two embeddings.
euclidean_distance(left, right) Compute the Euclidean distance between two embeddings.

File Functions#

audio_file(url[, verify, io_config]) Converts a string containing a file reference to a `daft.AudioFile` reference.
file(url[, io_config]) Converts a string containing a file reference to a `daft.File` reference.
file_path(file) Returns the path (URL) of the file as a string.
file_size(file) Returns the size of the file in bytes.
guess_mime_type(bytes_expr) Guess the MIME type of binary data by inspecting magic bytes.
image_file(url[, verify, io_config]) Converts a string containing a file reference to a `daft.ImageFile` reference.
video_file(url[, verify, io_config]) Converts a string containing a file reference to a `daft.VideoFile` reference.

Geospatial functions#

great_circle_distance(lat1, lon1, lat2, lon2) Compute the great circle distance between two points on the Earth.

Image File Functions#

decode_image_file(file_expr[, mode, on_error]) Decode image files from a File column into an Image column.
image_file_metadata(file_expr) Extract image metadata (width, height, format, mode) from a File column.

Image Functions#

convert_image(image, mode) Convert an image expression to the specified mode.
crop(image, bbox) Crops images with the provided bounding box.
decode_image(bytes[, on_error, mode]) Decodes the binary data in this column into images.
encode_image(image, image_format) Encode an image column as the provided image file format, returning a binary column of encoded bytes.
image_attribute(image, name) Get a property of the image, such as 'width', 'height', 'channel', or 'mode'.
image_channel(image) Gets the number of channels in an image.
image_hash(image[, method, hash_size, binbits, segments]) Compute a perceptual hash of an image column for near-duplicate detection.
image_height(image) Gets the height of an image in pixels.
image_mode(image) Gets the mode of an image.
image_to_tensor(image) Convert an image expression to a tensor, inferring dtype and shape.
image_width(image) Gets the width of an image in pixels.
resize(image, w, h) Resize image into the provided width and height.

LLM Functions#

llm_generate(text[, model, provider, concurrency, batch_size, num_cpus, num_gpus], **generation_config) A UDF for running LLM inference over an input column of strings.

List Functions#

chunk(list_expr, size) Splits each list into chunks of the given size.
explode(list_expr[, ignore_empty_and_null]) Explode a list expression.
list_append(list_expr, other) Appends a value to each list in the column.
list_bool_and(list_expr) Calculates the boolean AND of all values in a list.
list_bool_or(list_expr) Calculates the boolean OR of all values in a list.
list_contains(list_expr, item) Checks if each list contains the specified item.
list_count(list_expr[, mode]) Counts the number of elements in each list.
list_distinct(list_expr) Returns a list of unique elements in each list, preserving order of first occurrence and ignoring nulls.
list_filter(list_expr, predicate) Filters elements in a list using a boolean predicate expression.
list_flatten(list_expr) Flattens one level of nesting in each list.
list_join(list_expr, delimiter) Joins every element of a list using the specified string delimiter.
list_map(list_expr, mapper) Evaluates an expression on all elements in the list.
list_max(list_expr) Calculates the maximum of each list. If no non-null values in a list, the result is null.
list_mean(list_expr) Calculates the mean of each list. If no non-null values in a list, the result is null.
list_min(list_expr) Calculates the minimum of each list. If no non-null values in a list, the result is null.
list_sort(list_expr[, desc, nulls_first]) Sorts the inner lists of a list column.
list_sum(list_expr) Sums each list. Empty lists and lists with all nulls yield null.
seq(n) Generates a list of sequential integers [0, 1, 2, ..., n-1] for each row.
to_list(*items) Constructs a list from the item expressions.
value_counts(list_expr) Counts the occurrences of each distinct value in the list.

Miscellaneous Functions#

cast(expr, dtype) Casts an expression to the given datatype if possible.
coalesce(*args) Returns the first non-null value in a list of expressions. If all inputs are null, returns null.
concat(left, right) Concatenates two string or binary values.
eq_null_safe(left, right) Performs a null-safe equality comparison between two expressions.
fill_null(expr, fill_value) Fills null values in the Expression with the provided fill_value.
get(expr, key[, default]) Get an index from a list expression or a field from a struct expression.
hash(*exprs[, seed, hash_function]) Hashes the values in the Expression.
is_in(expr, other) Checks if values in the Expression are in the provided iterable.
is_null(expr) Checks if values in the Expression are Null (a special value indicating missing data).
length(expr) Retrieves the length of the given expression.
map_get(expr, key) Retrieves the value for a key in a map column.
map_keys(expr) Returns a list of all keys in the map.
minhash(text, num_hashes, ngram_size[, seed, hash_function]) Runs the MinHash algorithm on the series.
monotonically_increasing_id() Generates a column of monotonically increasing unique ids.
not_null(expr) Checks if values in the Expression are not Null (a special value indicating missing data).
random_int(low, high[, seed]) Generates a column of random integer values.
simhash(text[, ngram_size, hash_function]) Compute a SimHash fingerprint of the input text.
slice(expr, start[, end]) Get a subset of each list or binary value.
uuid() Generates a column of UUID strings.
when(condition, then) Start a conditional expression, similar to SQL CASE WHEN.

Numeric Functions#

abs(expr) Absolute of a numeric expression.
arccos(expr) The elementwise arc cosine of a numeric expression.
arccosh(expr) The elementwise inverse hyperbolic cosine of a numeric expression.
arcsin(expr) The elementwise arc sine of a numeric expression.
arcsinh(expr) The elementwise inverse hyperbolic sine of a numeric expression.
arctan(expr) The elementwise arc tangent of a numeric expression.
arctan2(y, x) Calculates the four quadrant arctangent of coordinates (y, x), in radians.
arctanh(expr) The elementwise inverse hyperbolic tangent of a numeric expression.
between(expr, lower, upper) Checks if values in the Expression are between lower and upper, inclusive.
bin(expr) Returns the string representation of the binary value of an integer.
cbrt(expr) The cube root of a numeric expression.
ceil(expr) The ceiling of a numeric expression.
clip(expr[, min, max]) Clips an expression to the given minimum and maximum values.
cos(expr) The elementwise cosine of a numeric expression.
cosh(expr) The elementwise hyperbolic cosine of a numeric expression.
cot(expr) The elementwise cotangent of a numeric expression.
csc(expr) The elementwise cosecant of a numeric expression.
degrees(expr) The elementwise degrees of a numeric expression.
e() Returns Euler's number (e = 2.71828...).
exp(expr) The e^expr of a numeric expression.
expm1(expr) The e^expr - 1 of a numeric expression.
factorial(expr) Returns the factorial of a non-negative integer.
fill_nan(expr, fill_value) Fills NaN values in the Expression with the provided fill_value.
floor(expr) The floor of a numeric expression.
hypot(a, b) Returns sqrt(a^2 + b^2), the Euclidean norm.
is_inf(expr) Checks if values in the Expression are Infinity.
is_nan(expr) Checks if values are NaN (a special float value indicating not-a-number).
ln(expr) The elementwise natural log of a numeric expression.
log(expr[, base]) The elementwise log with given base, of a numeric expression.
log10(expr) The elementwise log base 10 of a numeric expression.
log1p(expr) The ln(expr + 1) of a numeric expression.
log2(expr) The elementwise log base 2 of a numeric expression.
negate(expr) The negative of a numeric expression.
not_nan(expr) Checks if values are not NaN (a special float value indicating not-a-number).
pi() Returns the mathematical constant pi (3.14159...).
pmod(a, b) Returns the positive modulo of ``a`` by ``b``.
pow(base, expr) The base^expr of a numeric expression.
power(base, expr) The base^expr of a numeric expression.
radians(expr) The elementwise radians of a numeric expression.
round(expr[, decimals]) The round of a numeric expression.
sec(expr) The elementwise secant of a numeric expression.
sign(expr) The sign of a numeric expression.
sin(expr) The elementwise sine of a numeric expression.
sinh(expr) The elementwise hyperbolic sine of a numeric expression.
sqrt(expr) The square root of a numeric expression.
tan(expr) The elementwise tangent of a numeric expression.
tanh(expr) The elementwise hyperbolic tangent of a numeric expression.

Partitioning Functions#

partition_days(expr) Partitioning Transform that returns the number of days since epoch (1970-01-01).
partition_hours(expr) Partitioning Transform that returns the number of hours since epoch (1970-01-01).
partition_iceberg_bucket(expr, n) Partitioning Transform that returns the Hash Bucket following the Iceberg Specification of murmur3_32_x86.
partition_iceberg_truncate(expr, w) Partitioning Transform that truncates the input to a standard width `w` following the Iceberg Specification.
partition_months(expr) Partitioning Transform that returns the number of months since epoch (1970-01-01).
partition_years(expr) Partitioning Transform that returns the number of years since epoch (1970-01-01).

Process Functions#

run_process(args[, shell, on_error, return_dtype]) Returns an expression that runs an external process (optionally via a shell) and exposes its stdout as a column.

Similarity functions for vector inputs#

cosine_similarity(left, right) Compute the cosine similarity between two embeddings.
hamming_distance(left, right) Compute the Hamming distance (number of differing bits) between two hash fingerprints.
jaccard_similarity(left, right) Compute the Jaccard similarity between two embeddings.
pearson_correlation(left, right) Compute the Pearson correlation between two embeddings.

String Functions#

capitalize(expr) Capitalize a UTF-8 string.
concat_ws(sep, *exprs) Concatenates strings with a separator, skipping null values.
contains(expr, substr) Checks whether each string contains the given substring in a string column.
count_matches(expr, patterns[, whole_words, case_sensitive]) Counts the number of times a pattern, or multiple patterns, appear in a string.
deserialize(expr, format, dtype) Deserializes a string using the specified format and data type.
endswith(expr, suffix) Checks whether each string ends with the given suffix in a string column.
find(expr, substr) Returns the index of the first occurrence of the substring in each string.
format(f_string, *args) Format a string using the given arguments.
hamming_distance_str(left, right) Compute the character-level Hamming distance between two strings.
ilike(expr, pattern) Checks whether each string matches the given SQL ILIKE pattern, case insensitive.
jq(expr, filter) Applies a [jq](https://jqlang.github.io/jq/manual/) filter to a string, returning the results as a string.
left(expr, nchars) Gets the n (from nchars) left-most characters of each string.
length_bytes(expr) Retrieves the length for a UTF-8 string column in bytes.
like(expr, pattern) Checks whether each string matches the given SQL LIKE pattern, case sensitive.
lower(expr) Convert UTF-8 string to all lowercase.
lpad(expr, length, pad) Left-pads each string by truncating on the right or padding with the character.
lstrip(expr) Strip whitespace from the left side of a UTF-8 string.
normalize(expr[, remove_punct, lowercase, nfd_unicode, white_space]) Normalizes a string for more useful deduplication.
regexp(expr, pattern) Check whether each string matches the given regular expression pattern in a string column.
regexp_count(expr, pattern) Counts the number of times a regex pattern appears in a string.
regexp_extract(expr, pattern[, index]) Extracts the specified match group from the first regex match in each string in a string column.
regexp_extract_all(expr, pattern[, index]) Extracts the specified match group from all regex matches in each string in a string column.
regexp_replace(expr, pattern, replacement) Replaces all occurrences of a regex pattern in a string column with a replacement string.
regexp_split(expr, pattern) Splits each string on the given regex pattern, into a list of strings.
repeat(expr, n) Repeats each string n times.
replace(expr, search, replacement) Replaces all occurrences of a substring in a string with a replacement string.
reverse(expr) Reverse a UTF-8 string.
right(expr, nchars) Gets the n (from nchars) right-most characters of each string.
rpad(expr, length, pad) Right-pads each string by truncating or padding with the character.
rstrip(expr) Strip whitespace from the right side of a UTF-8 string.
serialize(expr, format) Serializes a value to a string using the specified format.
split(expr, split_on) Splits each string on the given string, into a list of strings.
startswith(expr, prefix) Checks whether each string starts with the given prefix in a string column.
strip(expr) Strip whitespace from both sides of string.
substr(expr, start[, length]) Extract a substring from a string, starting at a specified index and extending for a given length.
to_camel_case(expr) Convert a string to lower camel case.
to_kebab_case(expr) Convert a string to kebab case.
to_snake_case(expr) Convert a string to snake case.
to_title_case(expr) Convert a string to title case.
to_upper_camel_case(expr) Convert a string to upper camel case.
to_upper_kebab_case(expr) Convert a string to upper kebab case.
to_upper_snake_case(expr) Convert a string to upper snake case.
tokenize_decode(expr, tokens_path[, io_config, pattern, special_tokens]) Decodes each list of integer tokens into a string using a tokenizer.
tokenize_encode(expr, tokens_path[, io_config, pattern, special_tokens, use_special_tokens]) Encodes each string as a list of integer tokens using a tokenizer.
try_deserialize(expr, format, dtype) Deserializes a string using the specified format and data type, inserting nulls on failures.
upper(expr) Convert UTF-8 string to all upper.

Struct Functions#

to_struct(*fields, **named_fields) Constructs a struct from the input expressions.
unnest(expr) Flatten the fields of a struct expression into columns in a DataFrame.

URL Functions#

download(expr[, max_connections, on_error, io_config]) Treats each string as a URL, and downloads the bytes contents as a bytes column.
parse_url(expr) Parse string URLs and extract URL components.
upload(expr, location[, max_connections, on_error, io_config]) Uploads a column of binary data to the provided location(s) (also supports S3, local etc).

Video Functions#

video_frames(file_expr[, start_time, end_time, width, height, is_key_frame]) Decode all video frames within a time range, with per-frame metadata.
video_keyframes(file_expr[, start_time, end_time]) Get keyframes for a video file.
video_metadata(file_expr) Get metadata for a video file.

Window Functions#

dense_rank() Return the dense rank of the current row (used for window functions).
lag(expr[, offset, default]) Get the value from a previous row within a window partition.
lead(expr[, offset, default]) Get the value from a future row within a window partition.
over(expr, window) Apply the expression as a window function.
rank() Return the rank of the current row (used for window functions).
row_number() Return the row number of the current row (used for window functions).