Batch Inference#

Run prompts, embeddings, and model scoring over large datasets, then stream the results to durable storage. Daft is a reliable engine to express batch inference pipelines and scale them from your laptop to a distributed cluster.

When to use Daft for batch inference#

You need to run models over your data: Express inference on a column (e.g., prompt, embed_text, embed_image) and let Daft handle batching, concurrency, and backpressure.
You have data consisting of large objects in cloud storage: Daft has record-setting performance when reading from and writing to S3, and provides flexible APIs for working with URLs and Files.
You're working with multimodal data: Daft supports datatypes like images and videos, and supports the ability to define custom data sources and sinks and custom functions over this data.
You want end-to-end pipelines where data sizes expand and shrink: For example, downloading images from URLs, decoding them, then embedding them; Daft streams across stages to keep memory well-behaved.

If you’re new to Daft, see the quickstart first. For distributed execution, see our docs on Scaling Out and Deployment.

Core idea#

Daft provides first-class APIs for model inference. Under the hood, Daft pipelines data operations so that reading, inference, and writing overlap automatically, and is optimized for throughput.

Daft Pipeline Visualization

Example: Prompt GPT-5 with OpenAI#

import daft
from daft.functions import prompt

(
    daft.read_huggingface("fka/awesome-chatgpt-prompts")
    .with_column( # Generate model outputs in a new column
        "output",
        prompt(
            daft.col("prompt"),
            model="gpt-5",           # Any chat/completions-capable model
            provider="openai",       # Switch providers by changing this; e.g. to "vllm"
            max_output_tokens=256,   # OpenAI Provider uses Responses API by default
        ),
    )
    .write_parquet("output.parquet/", write_mode="overwrite")  # Write to Parquet as the pipeline runs
)

What this does:

Uses prompt() to express inference.
Streams rows through OpenAI concurrently while reading from Hugging Face and writing to Parquet.
Requires no explicit async, batching, rate limiting, or retry code in your script.

Example: Local text embedding with LM Studio#

import daft
from daft.ai.provider import load_provider
from daft.functions.ai import embed_text

provider = load_provider("lm_studio")
model = "text-embedding-nomic-embed-text-v1.5"

(
    daft.read_huggingface("Open-Orca/OpenOrca")
    .with_column("embedding", embed_text(daft.col("response"), provider=provider, model=model))
    .show()
)

Notes:

LM Studio is a local AI model platform that lets you run Large Language Models like Qwen, Mistral, Gemma, or gpt-oss on your own machine. By using Daft with LM Studio, you can perform inference with any model locally, and utilize accelerators like Apple's Metal Performance Shaders (MPS).

Scaling out on Ray#

Turn on distributed execution with a single line; then run the same script on a Ray cluster.

import daft
daft.set_runner_ray()  # Enable Daft's distributed runner

Daft partitions the data, schedules remote execution, and orchestrates your workload across the cluster. No pipeline rewrites.

Patterns that work well#

Read → Preprocess → Infer → Write: Daft parallelizes and pipelines automatically to maximize throughput and resource utilization.
Provider-agnostic pipelines: Switch between OpenAI and local LLMs by changing a single parameter.

Case Studies#

For inspiration and real-world scale:

Next Steps#

Ready to explore Daft further? Check out these topics:

AI functions
Reading from and writing to common data sources:
S3
Hugging Face 🤗
Turbopuffer
Scaling out and deployment