Migrating to Daft's new UDF API#
Daft now offers a new UDF API via the @daft.func and @daft.cls decorators, replacing the legacy @daft.udf decorator. The new API is more powerful and Pythonic, and the legacy API will be deprecated in a future release.
This guide will walk you through the steps to migrate your existing UDFs to the new API.
Function UDF#
Python functions decorated with @daft.udf can be easily converted into a batch function using the @daft.func.batch decorator.
1 2 3 4 5 6 7 8 9 | |
Then, they can be used exactly the same as before.
Just like with legacy UDFs, inputs to batch functions are daft.Series objects, and the same return types are supported as well: daft.Series, list, numpy.ndarray, or pyarrow.Array.
Most decorator parameters also have equivalent parameters in the legacy UDF decorator. See the Decorator Parameters section for more details.
Class UDF#
Creation#
Python classes decorated with @daft.udf can be easily converted into a class UDF with a batch method using the @daft.cls and @daft.method.batch decorators.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Usage#
Unlike legacy class UDFs, which can be called directly as a function, for new class UDFs, you must first create an instance of the class and then call the method on the instance.
1 2 3 4 5 6 7 8 9 10 | |
Instead of using .with_init_args(...) to specify the arguments to the __init__ method, you can set those arguments in the construction of the class instance instead.
1 2 3 4 5 6 7 8 9 10 11 | |
Note
Whereas legacy class UDFs require the implementation of the __call__ method, new class UDFs allow you to implement other methods in the class and use them as UDFs.
For example, here is a class with a generate method that can be used as a UDF:
1 2 3 4 5 6 7 | |
Then, you can use the generate method as follows:
1 2 3 4 | |
Decorator Parameters#
The following parameters stay the same between the legacy and new APIs:
- return_dtype
- batch_size
- use_process
- ray_options
For these parameters, here's what you can do with the new API:
- concurrency: The new API offers a
max_concurrencyparameter instead, which guarantees that at mostmax_concurrencyinstances of the UDF will be running at any given time, instead of exactlyconcurrencyinstances. - num_cpus: The new API offers a
cpusparameter on@daft.func,@daft.func.batch, and@daft.clswith the same placement semantics. Fractional values (e.g.0.5) are supported. - num_gpus: The new API offers a
gpusparameter on@daft.func,@daft.func.batch, and@daft.clswith the same placement semantics. Fractional values up to 1.0 are supported. - memory_bytes: The new API currently has no replacement for
memory_bytes.ray_options={"memory": ...}is explicitly rejected (see #6711). If you were usingmemory_bytesprimarily to limit concurrency, usemax_concurrencyinstead.
New parameters (no legacy equivalent)#
The new API adds two parameters for controlling error handling that had no equivalent in @daft.udf:
- max_retries: Retry failing calls with exponential backoff (100 ms → 60 s, ±25% jitter). Also honors
daft.ai.utils.RetryAfterErrorfor rate-limit-aware retries. - on_error:
"raise"(default),"log", or"ignore". Controls behavior once retries are exhausted —"log"and"ignore"emitNonefor the failing row so the query keeps running.
Both are available on @daft.func, @daft.func.batch, and @daft.cls. (@daft.method / @daft.method.batch accept the kwargs but currently ignore them — set the class-level value on @daft.cls instead.) See the Resources, Concurrency, and Error Handling section for details.
New Features#
Here are some features that are available in the new API that may help simplify your code and improve performance during your migration.
See the main Function UDF and Class UDF pages for a detailed description of the new API.
Row-wise Functions#
If you find that your UDF is simply iterating over the rows of the input data and computing a result for each row without any vectorized or batch operations, consider implementing it as a row-wise function using the @daft.func decorator.
Row-wise functions receive a single row of input data at a time, and return a single value for that row. Daft will automatically handle batching and conversion between Daft and Python types under the hood.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Return Type Inference#
With row-wise functions, Daft will also automatically infer the return type of the function based on the Python type annotations. For example, in the above example, by specifying -> int, Daft will automatically infer the return dtype to be daft.DataType.int64().
The return_dtype parameter is still supported, but it is not required. See the Type Conversions page for a mapping from Python types to Daft types.
Async Functions#
The new UDF API supports async Python functions natively for both row-wise and batch functions. Simply specify the async keyword in front of the function definition and use it like a regular Daft function.
Daft will handle the asynchronous execution of the function under the hood, so if you are calling to async functions from within your UDF, you can now just await them directly.
Example:
1 2 3 4 | |
Known Limitations#
- The new API does not yet expose a
memory_bytesparameter, andray_options={"memory": ...}is explicitly rejected (#6711). If you were usingmemory_bytesprimarily to bound concurrency, prefermax_concurrency. If you need true memory-based placement on Ray, you'll need to stay on@daft.udfuntil this is resolved.
If you have any questions or feedback about the new UDF API, please submit an issue on GitHub or reach out to us on Slack.