Evaluating Image Understanding at Scale with Structured Outputs#

An end-to-end example of Multimodal Structured Outputs with Daft and Qwen3-VL-8B

Introduction#

We'll evaluate Qwen3-VL's image understanding using a multiple choice subset of HuggingFace's The Cauldron dataset, a massive collection of 50 vision-language datasets.

Our pipeline will:

Run structured output inference on multiple choice questions with images and text
Conduct an ablation study (with vs. without images) to surface textual bias in image understanding
Classify results into diagnostic quadrants
Use VLM-as-a-Judge to explain failures

Check out the blog post where we evaluate Qwen3-VL-4B on 20k rows across 3 datasets.

About this Tutorial#

This tutorial demonstrates the core evaluation pipeline on a small sample (50 rows) so you can inspect examples and understand the methodology. For an end-to-end implementation that scales to millions of rows, see eval_image_understanding.py in the daft-examples repo.

1. Setup#

First, install the required dependencies:

pip install daft[openai] python-dotenv

Next, create a .env file in your project directory and add your HuggingFace token:

# .env
HF_TOKEN=your_huggingface_token_here

You can get a HuggingFace token from https://huggingface.co/settings/tokens.

Then, set up your environment variables and configuration:

import os
from dotenv import load_dotenv

load_dotenv()

# Configuration
MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"
LIMIT = 50  # Keep low for interactive demo

# HuggingFace Inference Provider (hosted Qwen3-VL endpoints)
OPENAI_API_KEY = os.getenv("HF_TOKEN")
OPENAI_BASE_URL = "https://router.huggingface.co/v1"

Configure Daft to use the OpenAI-compatible provider:

import daft

# Set the OpenAI-compatible provider
daft.set_provider("openai", api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

2. Data Loading#

The Cauldron is a massive collection of 50 vision-language datasets spanning: - Visual question answering - OCR & document understanding - Chart/figure understanding - Reasoning & math - And more...

We'll start with the AI2D subset—science diagrams with multiple-choice questions.

df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d").limit(LIMIT).collect()
df_raw.show(3)

The dataset contains nested structures with columns: - images: List of image bytes - texts: List of conversation turns with user (question) and assistant (answer) fields - Additional metadata fields

Each row represents a multiple-choice question with an accompanying science diagram.

3. Preprocessing#

We need to:

Decode images into Daft's Image type
Extract the question, choices, and correct answer from the text

from daft import col
from daft.functions import unnest

df_img = df_raw.explode(col("images"))
df_img = df_img.with_column("image", col("images")["bytes"].decode_image())

df_text = df_img.explode(col("texts")).select(unnest(col("texts")), "image")

df_prep = df_text.with_column(
    "answer",
    col("assistant").regexp_replace("Answer: ", "").lstrip().rstrip()
).collect()

df_prep.show(3)

4. Structured Outputs with `prompt`#

Daft's prompt function scales OpenAI-compatible calls across dataframes. We'll use a Pydantic model to enforce structured output.

For more info: API docs | User Guide

from daft.functions import prompt
from pydantic import BaseModel, Field
import time

PARAMS = {"temperature": 0.0, "max_tokens": 2}

class ChoiceResponse(BaseModel):
    """Structured output for multiple choice answers."""
    choice: str = Field(..., description="The letter of the correct choice (e.g., A, B, C, D)")

start = time.time()
df_results = df_prep.with_column(
    "result",
    prompt(
        messages=[col("image"), col("user")],
        model=MODEL_ID,
        use_chat_completions=True,
        return_format=ChoiceResponse,
        **PARAMS,
    )
).limit(LIMIT).collect()
elapsed = time.time() - start

print(f"Processed {df_results.count_rows()} rows in {elapsed:.1f} seconds")

df_eval = df_results.with_column(
    "is_correct",
    col("result")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
)

accuracy = df_eval.where(col("is_correct")).count_rows() / df_eval.count_rows()
print(f"Accuracy (with image): {accuracy:.1%}")

df_eval.select("user", "image", "answer", col("result")["choice"].alias("predicted"), "is_correct").show(5)

5. Ablation Study#

A simple accuracy score tells us how often the model is correct, but not why. Our full evaluation found that ~70% of correct answers on image understanding benchmarks don't actually require the image. To understand the true contribution of image understanding, we conduct an ablation study—running the same prompts without images.

This lets us classify each example into four quadrants:

Quadrant	With Image	Without Image	Interpretation
Both Correct	✓	✓	Question may be solvable from text alone
Image Helped	✓	✗	True image understanding
Image Hurt	✗	✓	Visual confusion
Both Incorrect	✗	✗	Hard question or model limitation

SYSTEM_PROMPT_NO_IMAGE = "Respond to the multiple choice question with just the letter corresponding to the correct answer."

start = time.time()
df_ablation = df_eval.with_column(
    "result_no_image",
    prompt(
        messages=col("user"),
        system_message=SYSTEM_PROMPT_NO_IMAGE,
        model=MODEL_ID,
        use_chat_completions=True,
        return_format=ChoiceResponse,
        **PARAMS,
    )
).with_column(
    "is_correct_no_image",
    col("result_no_image")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
).collect()
elapsed = time.time() - start

print(f"Processed {df_ablation.count_rows()} rows in {elapsed:.1f} seconds")

accuracy_no_image = df_ablation.where(col("is_correct_no_image")).count_rows() / df_ablation.count_rows()

print(f"Accuracy with image:    {accuracy:.1%}")
print(f"Accuracy without image: {accuracy_no_image:.1%}")
print(f"Delta:                  {accuracy - accuracy_no_image:+.1%}")

from daft.functions import when, monotonically_increasing_id

df_classified = df_ablation.with_column(
    "id", monotonically_increasing_id()
).with_column(
    "quadrant",
    when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
    .when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
    .when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
    .otherwise("Both Incorrect")
)

df_classified.groupby("quadrant").count().select("quadrant", col("id").alias("count")).show()

Inspect cases where the image helped:

df_classified.where(col("quadrant") == "Image Helped").select(
    "user", "image", "answer",
    col("result")["choice"].alias("with_image"),
    col("result_no_image")["choice"].alias("without_image")
).show(3)

df_classified.where(col("quadrant") == "Image Hurt").select(
    "user", "image", "answer",
    col("result")["choice"].alias("with_image"),
    col("result_no_image")["choice"].alias("without_image")
).show(3)

total_count = df_classified.count_rows()

df_results = df_classified.groupby("quadrant").count().select(
    "quadrant",
    col("id").alias("count")
).with_column(
    "percentage",
    (col("count") / daft.lit(total_count) * 100)
).collect()

df_results.show()

6. VLM-as-a-Judge#

We can go beyond pass/fail metrics by using VLM-as-a-Judge to explain why the model failed—especially on the most informative failure subsets: - Image Hurt: correct without the image, incorrect with the image - Both Incorrect: incorrect with and without the image

We'll use a structured output schema so the judge reliably returns fields we can analyze.

from daft.functions import format

JUDGE_SYSTEM_PROMPT = """
You are an impartial judge reviewing the results of a textbook academic questions multiple choice benchmark.
Inspect the attached image and provide high-signal feedback on why the model chose its answer.
First, reason about the model's answer with the image and the model's answer without the image.
Second, develop a hypothesis for why the model made the choice it did.
Third, attribute the failure to a 'question' issue or an 'image' understanding issue.
Finally, assign whether the model's answer with the image is correct and whether the model's answer without the image is correct.
"""


class JudgeResponse(BaseModel):
    """Structured diagnostic feedback from the VLM judge."""

    reasoning: str = Field(..., description="Why did the model choose the answer it did?")
    hypothesis: str = Field(..., description="What caused the divergence from the correct answer?")
    attribution: str = Field(
        ...,
        description="Was this a 'question' issue or an 'image' understanding issue or 'other'?",
    )

judge_template = format(
    """Given the image attached and the multiple choice question of <question>{}</question>,
The model chose the following prediction <model_answer>{}</model_answer> and without the image, the model chose the following prediction <no_image_model_answer>{}</no_image_model_answer>, but the correct answer is <correct_answer>{}</correct_answer>.

Provide diagnostic feedback.
""",
    col("user"),
    col("result")["choice"],
    col("result_no_image")["choice"],
    col("answer"),
)

df_failures = df_classified.where(
    (col("quadrant") == "Image Hurt") | (col("quadrant") == "Both Incorrect")
)

JUDGE_PARAMS = {"temperature": 0.0, "max_tokens": 512}

df_judged = df_failures.with_column(
    "judge_response",
    prompt(
        messages=[col("image"), judge_template],
        system_message=JUDGE_SYSTEM_PROMPT,
        model=MODEL_ID,
        use_chat_completions=True,
        return_format=JudgeResponse,
        **JUDGE_PARAMS,
    ),
).collect()

print(f"Judged {df_judged.count_rows()} failure rows")

The judge's attribution field helps separate question issues (ambiguous prompts) from image understanding issues (missed labels, visual ambiguity).

df_judged.select(
    "quadrant",
    "user",
    "image",
    "answer",
    col("result")["choice"].alias("with_image"),
    col("result_no_image")["choice"].alias("without_image"),
    unnest(col("judge_response")),
).show(3)

Verify the full pipeline ran:

print(f"Accuracy (with image):    {accuracy:.1%}")
print(f"Accuracy (without image): {accuracy_no_image:.1%}")
print(f"Delta:                   {accuracy - accuracy_no_image:+.1%}")

df_classified.groupby("quadrant").count().show()

print(f"Judge rows: {df_judged.count_rows()}")

7. Scale with Daft Cloud#

This tutorial runs locally on 50 rows. The Cauldron contains millions of rows across 50 subsets. To run this evaluation at scale, use Daft Cloud.

The production-ready script eval_image_understanding.py includes: - Multi-dataset evaluation across all Cauldron subsets - Configurable batch processing - Result aggregation and export

👉 Sign up for early access | Book a demo

8. Conclusion#

In this tutorial, we built a small pipeline to evaluate Qwen3-VL's image understanding:

Structured Outputs: Used Pydantic models to enforce consistent responses
Ablation Study: Isolated image understanding from general reasoning
Quadrant Analysis: Classified results into actionable categories
VLM-as-a-Judge: Diagnosed failures on the most informative subsets ("Image Hurt" + "Both Incorrect")

Next Steps#

Multi-Dataset Evaluation: Try the full pipeline from the daft-examples repository that supports evaluating across all 50 Cauldron subsets here.

Experiment Tracking: Wire judge feedback into MLflow or W&B to track improvements over time.

RLVR Training: Use the is_correct signal and judge attributions for reinforcement learning with verifiable rewards.

Resources#

Canonical References: - Getting Structured LLM Output (DeepLearning.ai) - Judging LLM-as-a-Judge (NeurIPS 2023)