Batch Inference#
Run prompts, embeddings, and model scoring over large datasets, then stream the results to durable storage. Daft is a reliable engine to express batch inference pipelines and scale them from your laptop to a distributed cluster.
When to use Daft for batch inference#
- You need to run models over your data: Express inference on a column (e.g.,
prompt,embed_text,embed_image) and let Daft handle batching, concurrency, and backpressure. - You have data consisting of large objects in cloud storage: Daft has record-setting performance when reading from and writing to S3, and provides flexible APIs for working with URLs and Files.
- You're working with multimodal data: Daft supports datatypes like images and videos, and supports the ability to define custom data sources and sinks and custom functions over this data.
- You want end-to-end pipelines where data sizes expand and shrink: For example, downloading images from URLs, decoding them, then embedding them; Daft streams across stages to keep memory well-behaved.
If you’re new to Daft, see the quickstart first. For distributed execution, see our docs on Scaling Out and Deployment.
Core idea#
Daft provides first-class APIs for model inference. Under the hood, Daft pipelines data operations so that reading, inference, and writing overlap automatically, and is optimized for throughput.

Example: Prompt GPT-5 with OpenAI#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
What this does:
- Uses
prompt()to express inference. - Streams rows through OpenAI concurrently while reading from Hugging Face and writing to Parquet.
- Requires no explicit async, batching, rate limiting, or retry code in your script.
Example: Local text embedding with LM Studio#
1 2 3 4 5 6 7 8 9 10 11 12 | |
Notes:
- LM Studio is a local AI model platform that lets you run Large Language Models like Qwen, Mistral, Gemma, or gpt-oss on your own machine. By using Daft with LM Studio, you can perform inference with any model locally, and utilize accelerators like Apple's Metal Performance Shaders (MPS).
Scaling out on Ray#
Turn on distributed execution with a single line; then run the same script on a Ray cluster.
1 2 | |
Daft partitions the data, schedules remote execution, and orchestrates your workload across the cluster. No pipeline rewrites.
Patterns that work well#
- Read → Preprocess → Infer → Write: Daft parallelizes and pipelines automatically to maximize throughput and resource utilization.
- Provider-agnostic pipelines: Switch between OpenAI and local LLMs by changing a single parameter.
Case Studies#
For inspiration and real-world scale:
- Processing 24 trillion tokens with 0 crashes—How Essential AI built Essential-Web v1.0 with Daft
- Processing 300K Images Without OOMs
- Embedding millions of text documents with Qwen3, achieving near 100% GPU utilization
Next Steps#
Ready to explore Daft further? Check out these topics:
- AI functions
- Reading from and writing to common data sources:
- S3
- Hugging Face 🤗
- Turbopuffer
- Scaling out and deployment