Working with Images#

Daft is built to work comfortably with images. This guide shows you how to accomplish common image processing tasks with Daft:

Downloading and decoding images
Near-duplicate detection with perceptual hashing
Generate image embeddings
Classify images

It also explains some concepts on Dynamic execution for multimodal workloads to improve your mental model of how the Daft engine works.

Quickstart#

To setup this example, let's read a Parquet file from a public S3 bucket containing sample dog owners, use daft.col() with the df.with_column method to create a new column full_name, and join the contents from the last_name column to the first_name column. Then, let's create a dogs DataFrame from a Python dictionary and use df.join to join this with our dataframe of owners:

import daft
from daft import col

# Read parquet file containing sample dog owners
df = daft.read_parquet("s3://daft-oss-public-data/tutorials/10-min/sample-data-dog-owners-partitioned.pq/**")

# Combine "first_name" and "last_name" to create new column "full_name"
df = df.with_column("full_name", col("first_name") + " " + col("last_name"))
df.select("full_name", "age", "country", "has_dog").show()

# Create dataframe of dogs
df_dogs = daft.from_pydict(
    {
        "urls": [
            "https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
            "https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
            "https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
            "https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
            "https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
        ],
        "full_name": [
            "Ernesto Evergreen",
            "James Jale",
            "Wolfgang Winter",
            "Shandra Shamas",
            "Zaya Zaphora",
        ],
        "dog_name": ["Ernie", "Jackie", "Wolfie", "Shaggie", "Zadie"],
    }
)

# Join owners with dogs, dropping some columns
df_family = df.join(df_dogs, on="full_name").exclude("first_name", "last_name", "DoB", "country", "age")
df_family.show()

Output
╭───────────────────┬─────────┬────────────────────────────────┬──────────╮
│ full_name         ┆ has_dog ┆ urls                           ┆ dog_name │
│ ---               ┆ ---     ┆ ---                            ┆ ---      │
│ Utf8              ┆ Boolean ┆ Utf8                           ┆ Utf8     │
╞═══════════════════╪═════════╪════════════════════════════════╪══════════╡
│ Wolfgang Winter   ┆ None    ┆ https://live.staticflickr.com… ┆ Wolfie   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ Shandra Shamas    ┆ true    ┆ https://live.staticflickr.com… ┆ Shaggie  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ Zaya Zaphora      ┆ true    ┆ https://live.staticflickr.com… ┆ Zadie    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ Ernesto Evergreen ┆ true    ┆ https://live.staticflickr.com… ┆ Ernie    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ James Jale        ┆ true    ┆ https://live.staticflickr.com… ┆ Jackie   │
╰───────────────────┴─────────┴────────────────────────────────┴──────────╯

(Showing first 5 of 5 rows)

You can use the download() expression to download the bytes from a URL. Let's store them in a new column using the df.with_column() method:

🐍 Python

df_family = df_family.with_column("image_bytes", col("urls").download(on_error="null"))
df_family.show()

Output
╭───────────────────┬─────────┬────────────────────────────────┬──────────┬────────────────────────────────╮
│ full_name         ┆ has_dog ┆ urls                           ┆ dog_name ┆ image_bytes                    │
│ ---               ┆ ---     ┆ ---                            ┆ ---      ┆ ---                            │
│ Utf8              ┆ Boolean ┆ Utf8                           ┆ Utf8     ┆ Binary                         │
╞═══════════════════╪═════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ Wolfgang Winter   ┆ None    ┆ https://live.staticflickr.com… ┆ Wolfie   ┆ b"\xff\xd8\xff\xe0\x00\x10JFI… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Shandra Shamas    ┆ true    ┆ https://live.staticflickr.com… ┆ Shaggie  ┆ b"\xff\xd8\xff\xe0\x00\x10JFI… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Zaya Zaphora      ┆ true    ┆ https://live.staticflickr.com… ┆ Zadie    ┆ b"\xff\xd8\xff\xe0\x00\x10JFI… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Ernesto Evergreen ┆ true    ┆ https://live.staticflickr.com… ┆ Ernie    ┆ b"\xff\xd8\xff\xe0\x00\x10JFI… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ James Jale        ┆ true    ┆ https://live.staticflickr.com… ┆ Jackie   ┆ b"\xff\xd8\xff\xe0\x00\x10JFI… │
╰───────────────────┴─────────┴────────────────────────────────┴──────────┴────────────────────────────────╯

(Showing first 5 of 5 rows)

Let's turn the bytes into human-readable images using decode_image():

🐍 Python

df_family = df_family.with_column("image", daft.col("image_bytes").decode_image())
df_family.show()

End-to-End Image Pipeline#

This example demonstrates a complete pipeline: URL -> download -> decode -> resize -> to_tensor -> normalize.

This is a common preprocessing pipeline for preparing images for Deep Learning models (e.g., PyTorch).

🐍 Python

import daft
from daft import col, DataType
import numpy as np

# 1. Create a DataFrame with image URLs
df = daft.from_pydict({
    "urls": [
        "https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
        "https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
        "https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
        "https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
        "https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
    ],
})

# 2. Define a UDF for normalization (Standard ImageNet normalization)
@daft.func(return_dtype=DataType.tensor(DataType.float32()))
def normalize_image(img):
    if img is None:
        return None

    # Standard ImageNet normalization mean and std
    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)

    # Convert to float32 and scale to [0, 1]
    # Input img is [H, W, C]
    img_float = img.astype(np.float32) / 255.0

    # Normalize
    # img_float is [H, W, C], mean/std are [3] broadcasting over the last dimension
    normalized = (img_float - mean) / std

    # Transpose to [C, H, W] for PyTorch models
    normalized = normalized.transpose(2, 0, 1)

    return normalized

# 3. Build the pipeline: URL -> download -> decode -> resize -> to_tensor -> normalize
df = df.with_column("image", col("urls").download(on_error="null").decode_image().resize(224, 224))
df = df.with_column("tensor", col("image").image_to_tensor())
df = df.with_column("normalized", normalize_image(col("tensor")))

df.collect()
df.select("urls", "image", "normalized").show()

Output
╭────────────────────────────────┬───────────────────────┬──────────────────────────────╮
│ urls                           ┆ image                 ┆ normalized                   │
│ ---                            ┆ ---                   ┆ ---                          │
│ String                         ┆ Image[RGB; 224 x 224] ┆ Tensor[Float32]              │
╞════════════════════════════════╪═══════════════════════╪══════════════════════════════╡
│ https://live.staticflickr.com… ┆ <FixedShapeImage>     ┆ <Tensor shape=(3, 224, 224)> │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ https://live.staticflickr.com… ┆ <FixedShapeImage>     ┆ <Tensor shape=(3, 224, 224)> │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ https://live.staticflickr.com… ┆ <FixedShapeImage>     ┆ <Tensor shape=(3, 224, 224)> │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ https://live.staticflickr.com… ┆ <FixedShapeImage>     ┆ <Tensor shape=(3, 224, 224)> │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ https://live.staticflickr.com… ┆ <FixedShapeImage>     ┆ <Tensor shape=(3, 224, 224)> │
╰────────────────────────────────┴───────────────────────┴──────────────────────────────╯

(Showing first 5 of 5 rows)

UDF Best Practices for Images#

When processing images with User-Defined Functions (UDFs) in Daft, using libraries like Pillow, OpenCV, or torchvision efficiently is key to performance and robustness.

1. Handling `None` Values#

Daft data may contain None (null) values. Your UDF must handle these gracefully to avoid runtime errors.

🐍 Python

import daft
from PIL import Image
import io

@daft.func(return_dtype=daft.DataType.binary())
def process_image(image_bytes):
    # Always check for None!
    if image_bytes is None:
        return None

    try:
        img = Image.open(io.BytesIO(image_bytes))
        # ... processing ...

        # Serialize back to bytes for efficiency
        out = io.BytesIO()
        img.save(out, format=img.format or "PNG")
        return out.getvalue()
    except Exception:
        # Decide whether to return None or raise an error
        return None

2. Choosing the Right `return_dtype`#

The return_dtype argument in @daft.func or @daft.udf is crucial. It tells Daft what kind of data to expect, allowing for optimizations and correct schema inference.

daft.DataType.tensor(dtype): Best for returning numerical data (numpy arrays, torch tensors). This allows Daft to treat the column as a native tensor type, enabling further vectorized operations.
daft.DataType.binary(): Best for returning raw bytes (e.g. encoded PNG/JPEG data). This is often more memory efficient than full bitmaps, and avoids the pickling overhead associated with Python objects.
daft.DataType.python(): Use this if you are returning arbitrary Python objects (like PIL.Image objects) that don't map neatly to a Daft type. Note: Python objects cannot be serialized as efficiently and may block some downstream optimizations.

3. Performance: `numpy` / `torch` vs `PIL.Image`#

Returning native arrays (NumPy or PyTorch) is generally more performant than returning Python objects like PIL.Image, especially when return_dtype is set to a Tensor type.

Why?#

Zero-copy / Low-overhead: Daft can often manage memory for Arrow/Tensor types more efficiently.
Serialization: PIL.Image objects are pickled/unpickled when moved between processes, which is slow. Tensors have efficient binary representations.

Example: Returning a Tensor (Recommended)#

🐍 Python

import numpy as np

@daft.func(return_dtype=daft.DataType.tensor(daft.DataType.uint8()))
def image_to_numpy(image_bytes):
    if image_bytes is None:
        return None

    img = Image.open(io.BytesIO(image_bytes))
    # Convert to numpy array
    return np.array(img)

Example: Using torchvision#

When using torchvision, operations typically return torch.Tensor. You can return these directly if you specify a Tensor return type.

🐍 Python

import torch
import torchvision.transforms.functional as F
import numpy as np

@daft.func(return_dtype=daft.DataType.tensor(daft.DataType.float32()))
def transform_image(image_tensor):
    if image_tensor is None:
        return None

    # Assuming input is already a tensor or numpy array
    if isinstance(image_tensor, np.ndarray):
        image_tensor = torch.from_numpy(image_tensor)

    # Ensure channel-first format (C, H, W) for torchvision
    if image_tensor.ndim == 3 and image_tensor.shape[-1] == 3:
        image_tensor = image_tensor.permute(2, 0, 1)

    # Apply torchvision transforms using F
    image_tensor = F.resize(image_tensor, [224, 224])

    return image_tensor

4. Batch Processing with `@daft.func.batch`#

For even higher performance, especially with heavy libraries like OpenCV or PyTorch, consider using batched UDFs to process multiple rows at once, reducing Python function call overhead.

🐍 Python

@daft.func.batch(return_dtype=daft.DataType.tensor(daft.DataType.uint8()))
def batch_process_images(series):
    # 'series' is a Daft Series object
    # Convert to list of inputs
    inputs = series.to_pylist()

    results = []
    for item in inputs:
        if item is None:
            results.append(None)
            continue
        # Process item...
        # results.append(processed_item)

    return results

Near-Duplicate Detection with Perceptual Hashing#

image_hash() computes a compact perceptual hash for each image. Two hashes with a low Hamming distance indicate visually similar images, making this a fast first-pass filter for near-duplicate detection at scale.

Algorithms#

Method	Description
`"phash"` (default)	Full 2D DCT perceptual hash — most robust to mild edits
`"phash_simple"`	Row-wise DCT only, compared to mean — faster variant of phash
`"dhash"`	Horizontal difference / gradient hash — fast and accurate
`"dhash_vertical"`	Vertical difference hash — compares top/bottom neighbours
`"ahash"`	Average hash — fastest, least robust
`"whash"`	Multi-level Haar wavelet hash (bit-exact with `imagehash.whash`)
`"colorhash"`	Color distribution hash in HSV space; use `binbits` to control precision
`"crop_resistant"`	Divides the image into a 3×3 grid and hashes each segment; robust against cropping at the cost of a larger (9×) hash

Basic usage#

🐍 Python

import daft
from daft.functions import image_hash

df = daft.from_pydict({"urls": ["https://example.com/a.jpg", "https://example.com/b.jpg"]})
df = (
    df.with_column("image", daft.col("urls").download(on_error="null").decode_image())
      .with_column("hash", image_hash(daft.col("image")))  # default: phash, hash_size=8
)
df.select("urls", "hash").show()

The hash column has dtype FixedSizeBinary(8) — 64 bits per image for the default hash_size=8.

Finding near-duplicates#

Compare hashes within a dataset by joining the DataFrame with itself and computing the bitwise Hamming distance using the built-in hamming_distance:

🐍 Python

import daft
from daft.functions import image_hash, hamming_distance

df = daft.from_pydict({
    "id": [1, 2, 3],
    "image": [...],  # Image column
})
df = df.with_column("hash", image_hash(daft.col("image")))

# Self-join to find all pairs
left = df.select(daft.col("id").alias("id_a"), daft.col("hash").alias("hash_a"))
right = df.select(daft.col("id").alias("id_b"), daft.col("hash").alias("hash_b"))

pairs = (
    left.join(right, how="cross")
        .where(daft.col("id_a") < daft.col("id_b"))
        .with_column("dist", hamming_distance(daft.col("hash_a"), daft.col("hash_b")))
        .where(daft.col("dist") <= 10)  # threshold: ≤10 bits differ
)
pairs.show()

Crop-resistant hashing#

Use method="crop_resistant" when images may have been cropped or have different aspect ratios. The output hash is 9× larger (72 bytes for hash_size=8):

🐍 Python

df = df.with_column("hash_cr", image_hash(daft.col("image"), method="crop_resistant"))

Generate Image Embeddings#

Image embeddings convert images into numerical vectors that capture semantic meaning. Use them for semantic search, similarity calculations, etc.

How to use the embed_image function#

By default, embed_image uses the Transformers provider, which requires the transformers optional dependency. By default we also use OpenAI's CLIP model (openai/clip-vit-base-patch32).

pip install -U "daft[transformers]"

Once installed, we can run:

import daft
from daft.functions.ai import embed_image

(
    daft.read_huggingface("xai-org/RealworldQA")
    .with_column("image", daft.col("image")["bytes"].decode_image())
    .with_column("embedding", embed_image(daft.col("image")))
    .show()
)

Classify Images#

We'll define a function that uses a pre-trained PyTorch model: ResNet50 to classify the dog pictures. We'll pass the contents of the image urls column and send the classification predictions to a new column classify_breed.

Working with PyTorch adds some complexity but you can just run the cells below to perform the classification.

First, make sure to install and import some extra dependencies:

pip install validators matplotlib Pillow torch torchvision

🐍 Python

# import additional libraries, these are necessary for PyTorch
import torch

Define your ClassifyImages UDF. Models are expensive to initialize and load, so we want to do this as few times as possible, and share a model across multiple invocations.

🐍 Python

@daft.udf(return_dtype=daft.DataType.fixed_size_list(dtype=daft.DataType.string(), size=2))
class ClassifyImages:
    def __init__(self):
        # Perform expensive initializations - create and load the pre-trained model
        self.model = torch.hub.load("NVIDIA/DeepLearningExamples:torchhub", "nvidia_resnet50", pretrained=True)
        self.utils = torch.hub.load("NVIDIA/DeepLearningExamples:torchhub", "nvidia_convnets_processing_utils")
        self.model.eval().to(torch.device("cpu"))

    def __call__(self, images_urls):
        batch = torch.cat([self.utils.prepare_input_from_uri(uri) for uri in images_urls]).to(torch.device("cpu"))

        with torch.no_grad():
            output = torch.nn.functional.softmax(self.model(batch), dim=1)

        results = self.utils.pick_n_best(predictions=output, n=1)
        return [result[0] for result in results]

Now you're ready to call this function on the urls column and store the outputs in a new column we'll call classify_breed:

🐍 Python

classified_images_df = df_family.with_column("classify_breed", ClassifyImages(daft.col("urls")))
classified_images_df.select("dog_name", "image", "classify_breed").show()

Output
╭──────────┬──────────────┬────────────────────────────────╮
│ dog_name ┆ image        ┆ classify_breed                 │
│ ---      ┆ ---          ┆ ---                            │
│ Utf8     ┆ Image[MIXED] ┆ FixedSizeList[Utf8; 2]         │
╞══════════╪══════════════╪════════════════════════════════╡
│ Ernie    ┆ <Image>      ┆ [boxer, 52.3%]                 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Jackie   ┆ <Image>      ┆ [American Staffordshire terri… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Shaggie  ┆ <Image>      ┆ [standard schnauzer, 29.6%]    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Zadie    ┆ <Image>      ┆ [Rottweiler, 78.6%]            │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Wolfie   ┆ <Image>      ┆ [collie, 49.6%]                │
╰──────────┴──────────────┴────────────────────────────────╯

(Showing first 5 of 5 rows)

Note

Execute in notebook to see properly rendered images.

Zero Shot Classification#

For zero shot classification, you can use our built in classify_image function to classify images

🐍 Python

classify_images_expr = daft.functions.classify_image(
  daft.col("image"), labels=[
    "boxer",
    "schnauzer",
    "rottweiler",
    "staffordshire terrier",
    "collie",
    "chihuahua",
    "corgi"
  ]
)
classified_images_df = df_family.with_column("classify_breed", classify_images_expr)
classified_images_df.select("dog_name", "image", "classify_breed").show()

Output
╭──────────┬──────────────┬───────────────────────╮
│ dog_name ┆ image        ┆ classify_breed        │
│ ---      ┆ ---          ┆ ---                   │
│ String   ┆ Image[MIXED] ┆ String                │
╞══════════╪══════════════╪═══════════════════════╡
│ Ernie    ┆ <Image>      ┆ boxer                 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Jackie   ┆ <Image>      ┆ staffordshire terrier │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Wolfie   ┆ <Image>      ┆ collie                │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Shaggie  ┆ <Image>      ┆ schnauzer             │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Zadie    ┆ <Image>      ┆ rottweiler            │
╰──────────┴──────────────┴───────────────────────╯

(Showing first 5 of 5 rows)

Dynamic Execution for Multimodal Workloads#

Daft uses dynamic execution to automatically adjust batch sizes based on the operation type and data characteristics.

This is necessary because multimodal data such as images, videos, and audio files have different memory and processing characteristics that can cause issues with fixed batching: large batches may exceed available memory, while small batches may not fully utilize hardware optimizations or network bandwidth.

How Batch Sizes Are Determined#

Multimodal Downloads: Downloads for multimodal data use smaller batch sizes (typically a factor of the max_connections parameter) to prevent memory exhaustion when downloading large files, while maintaining network throughput.

Vectorized Operations: Operations that can operate on many rows in parallel, such as byte decoding / encoding, aggregations, and scalar projections, will use larger batch sizes that can take advantage of vectorized execution using SIMD.

🐍 Python

# Each operation uses different batch sizes automatically
df = (
    daft.read_parquet("metadata.parquet")  # Large batches
    .with_column("image_data", col("urls").download())  # Small batches
    .with_column("image", col("image_data").decode_image())  # Decode to image
    .with_column("resized", col("image").resize(224, 224))  # Medium batches
)

This approach allows processing of datasets larger than available memory, while maintaining optimal performance for each operation type.

Working with Images#

Quickstart#

End-to-End Image Pipeline#

UDF Best Practices for Images#

1. Handling None Values#

2. Choosing the Right return_dtype#

3. Performance: numpy / torch vs PIL.Image#

Why?#

Example: Returning a Tensor (Recommended)#

Example: Using torchvision#

4. Batch Processing with @daft.func.batch#

Near-Duplicate Detection with Perceptual Hashing#

Algorithms#

Basic usage#

Finding near-duplicates#

Crop-resistant hashing#

Generate Image Embeddings#

How to use the embed_image function#

Classify Images#

Zero Shot Classification#

Dynamic Execution for Multimodal Workloads#

How Batch Sizes Are Determined#

1. Handling `None` Values#

2. Choosing the Right `return_dtype`#

3. Performance: `numpy` / `torch` vs `PIL.Image`#

4. Batch Processing with `@daft.func.batch`#