An end-to-end example of Multimodal Structured Outputs with Daft and Qwen3-VL-8B
Introduction
We'll evaluate Qwen3-VL's image understanding using a multiple choice subset of HuggingFace's The Cauldron dataset, a massive collection of 50 vision-language datasets.
Our pipeline will:
- Run structured output inference on multiple choice questions with images and text
- Conduct an ablation study (with vs. without images) to surface textual bias in image understanding
- Classify results into diagnostic quadrants
- Use VLM-as-a-Judge to explain failures
Check out the blog post where we evaluate Qwen3-VL-4B on 20k rows across 3 datasets.
About this Tutorial
This tutorial demonstrates the core evaluation pipeline on a small sample (50 rows) so you can inspect examples and understand the methodology. For an end-to-end implementation that scales to millions of rows, see eval_image_understanding.py in the daft-examples repo.
Table of Contents
- Setup
- Data Loading
- Preprocessing
- Structured Outputs with
prompt - Ablation Study
- VLM-as-a-Judge
- Scale with Daft Cloud
- Conclusion
1. Setup
First, install the required dependencies:
| pip install daft[openai] python-dotenv
|
Next, create a .env file in your project directory and add your HuggingFace token:
| # .env
HF_TOKEN=your_huggingface_token_here
|
You can get a HuggingFace token from https://huggingface.co/settings/tokens.
Then, set up your environment variables and configuration:
1
2
3
4
5
6
7
8
9
10
11
12 | import os
from dotenv import load_dotenv
load_dotenv()
# Configuration
MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"
LIMIT = 50 # Keep low for interactive demo
# HuggingFace Inference Provider (hosted Qwen3-VL endpoints)
OPENAI_API_KEY = os.getenv("HF_TOKEN")
OPENAI_BASE_URL = "https://router.huggingface.co/v1"
|
Configure Daft to use the OpenAI-compatible provider:
| import daft
# Set the OpenAI-compatible provider
daft.set_provider("openai", api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)
|
2. Data Loading
The Cauldron is a massive collection of 50 vision-language datasets spanning: - Visual question answering - OCR & document understanding - Chart/figure understanding - Reasoning & math - And more...
We'll start with the AI2D subset—science diagrams with multiple-choice questions.
| df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d").limit(LIMIT).collect()
df_raw.show(3)
|
The dataset contains nested structures with columns: - images: List of image bytes - texts: List of conversation turns with user (question) and assistant (answer) fields - Additional metadata fields
Each row represents a multiple-choice question with an accompanying science diagram.
3. Preprocessing
We need to:
- Decode images into Daft's Image type
- Extract the question, choices, and correct answer from the text
1
2
3
4
5
6
7
8
9
10
11
12
13
14 | from daft import col
from daft.functions import unnest
df_img = df_raw.explode(col("images"))
df_img = df_img.with_column("image", col("images")["bytes"].decode_image())
df_text = df_img.explode(col("texts")).select(unnest(col("texts")), "image")
df_prep = df_text.with_column(
"answer",
col("assistant").regexp_replace("Answer: ", "").lstrip().rstrip()
).collect()
df_prep.show(3)
|
4. Structured Outputs with prompt
Daft's prompt function scales OpenAI-compatible calls across dataframes. We'll use a Pydantic model to enforce structured output.
For more info: API docs | User Guide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24 | from daft.functions import prompt
from pydantic import BaseModel, Field
import time
PARAMS = {"temperature": 0.0, "max_tokens": 2}
class ChoiceResponse(BaseModel):
"""Structured output for multiple choice answers."""
choice: str = Field(..., description="The letter of the correct choice (e.g., A, B, C, D)")
start = time.time()
df_results = df_prep.with_column(
"result",
prompt(
messages=[col("image"), col("user")],
model=MODEL_ID,
use_chat_completions=True,
return_format=ChoiceResponse,
**PARAMS,
)
).limit(LIMIT).collect()
elapsed = time.time() - start
print(f"Processed {df_results.count_rows()} rows in {elapsed:.1f} seconds")
|
| df_eval = df_results.with_column(
"is_correct",
col("result")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
)
accuracy = df_eval.where(col("is_correct")).count_rows() / df_eval.count_rows()
print(f"Accuracy (with image): {accuracy:.1%}")
df_eval.select("user", "image", "answer", col("result")["choice"].alias("predicted"), "is_correct").show(5)
|
5. Ablation Study
A simple accuracy score tells us how often the model is correct, but not why. Our full evaluation found that ~70% of correct answers on image understanding benchmarks don't actually require the image. To understand the true contribution of image understanding, we conduct an ablation study—running the same prompts without images.
This lets us classify each example into four quadrants:
| Quadrant | With Image | Without Image | Interpretation |
| Both Correct | ✓ | ✓ | Question may be solvable from text alone |
| Image Helped | ✓ | ✗ | True image understanding |
| Image Hurt | ✗ | ✓ | Visual confusion |
| Both Incorrect | ✗ | ✗ | Hard question or model limitation |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 | SYSTEM_PROMPT_NO_IMAGE = "Respond to the multiple choice question with just the letter corresponding to the correct answer."
start = time.time()
df_ablation = df_eval.with_column(
"result_no_image",
prompt(
messages=col("user"),
system_message=SYSTEM_PROMPT_NO_IMAGE,
model=MODEL_ID,
use_chat_completions=True,
return_format=ChoiceResponse,
**PARAMS,
)
).with_column(
"is_correct_no_image",
col("result_no_image")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
).collect()
elapsed = time.time() - start
print(f"Processed {df_ablation.count_rows()} rows in {elapsed:.1f} seconds")
accuracy_no_image = df_ablation.where(col("is_correct_no_image")).count_rows() / df_ablation.count_rows()
print(f"Accuracy with image: {accuracy:.1%}")
print(f"Accuracy without image: {accuracy_no_image:.1%}")
print(f"Delta: {accuracy - accuracy_no_image:+.1%}")
|
1
2
3
4
5
6
7
8
9
10
11
12
13 | from daft.functions import when, monotonically_increasing_id
df_classified = df_ablation.with_column(
"id", monotonically_increasing_id()
).with_column(
"quadrant",
when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
.when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
.when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
.otherwise("Both Incorrect")
)
df_classified.groupby("quadrant").count().select("quadrant", col("id").alias("count")).show()
|
Inspect cases where the image helped:
| df_classified.where(col("quadrant") == "Image Helped").select(
"user", "image", "answer",
col("result")["choice"].alias("with_image"),
col("result_no_image")["choice"].alias("without_image")
).show(3)
|
| df_classified.where(col("quadrant") == "Image Hurt").select(
"user", "image", "answer",
col("result")["choice"].alias("with_image"),
col("result_no_image")["choice"].alias("without_image")
).show(3)
|
| total_count = df_classified.count_rows()
df_results = df_classified.groupby("quadrant").count().select(
"quadrant",
col("id").alias("count")
).with_column(
"percentage",
(col("count") / daft.lit(total_count) * 100)
).collect()
df_results.show()
|
6. VLM-as-a-Judge
We can go beyond pass/fail metrics by using VLM-as-a-Judge to explain why the model failed—especially on the most informative failure subsets: - Image Hurt: correct without the image, incorrect with the image - Both Incorrect: incorrect with and without the image
We'll use a structured output schema so the judge reliably returns fields we can analyze.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | from daft.functions import format
JUDGE_SYSTEM_PROMPT = """
You are an impartial judge reviewing the results of a textbook academic questions multiple choice benchmark.
Inspect the attached image and provide high-signal feedback on why the model chose its answer.
First, reason about the model's answer with the image and the model's answer without the image.
Second, develop a hypothesis for why the model made the choice it did.
Third, attribute the failure to a 'question' issue or an 'image' understanding issue.
Finally, assign whether the model's answer with the image is correct and whether the model's answer without the image is correct.
"""
class JudgeResponse(BaseModel):
"""Structured diagnostic feedback from the VLM judge."""
reasoning: str = Field(..., description="Why did the model choose the answer it did?")
hypothesis: str = Field(..., description="What caused the divergence from the correct answer?")
attribution: str = Field(
...,
description="Was this a 'question' issue or an 'image' understanding issue or 'other'?",
)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 | judge_template = format(
"""Given the image attached and the multiple choice question of <question>{}</question>,
The model chose the following prediction <model_answer>{}</model_answer> and without the image, the model chose the following prediction <no_image_model_answer>{}</no_image_model_answer>, but the correct answer is <correct_answer>{}</correct_answer>.
Provide diagnostic feedback.
""",
col("user"),
col("result")["choice"],
col("result_no_image")["choice"],
col("answer"),
)
df_failures = df_classified.where(
(col("quadrant") == "Image Hurt") | (col("quadrant") == "Both Incorrect")
)
JUDGE_PARAMS = {"temperature": 0.0, "max_tokens": 512}
df_judged = df_failures.with_column(
"judge_response",
prompt(
messages=[col("image"), judge_template],
system_message=JUDGE_SYSTEM_PROMPT,
model=MODEL_ID,
use_chat_completions=True,
return_format=JudgeResponse,
**JUDGE_PARAMS,
),
).collect()
print(f"Judged {df_judged.count_rows()} failure rows")
|
The judge's attribution field helps separate question issues (ambiguous prompts) from image understanding issues (missed labels, visual ambiguity).
| df_judged.select(
"quadrant",
"user",
"image",
"answer",
col("result")["choice"].alias("with_image"),
col("result_no_image")["choice"].alias("without_image"),
unnest(col("judge_response")),
).show(3)
|
Verify the full pipeline ran:
| print(f"Accuracy (with image): {accuracy:.1%}")
print(f"Accuracy (without image): {accuracy_no_image:.1%}")
print(f"Delta: {accuracy - accuracy_no_image:+.1%}")
df_classified.groupby("quadrant").count().show()
print(f"Judge rows: {df_judged.count_rows()}")
|
7. Scale with Daft Cloud
This tutorial runs locally on 50 rows. The Cauldron contains millions of rows across 50 subsets. To run this evaluation at scale, use Daft Cloud.
The production-ready script eval_image_understanding.py includes: - Multi-dataset evaluation across all Cauldron subsets - Configurable batch processing - Result aggregation and export
👉 Sign up for early access | Book a demo
8. Conclusion
In this tutorial, we built a small pipeline to evaluate Qwen3-VL's image understanding:
- Structured Outputs: Used Pydantic models to enforce consistent responses
- Ablation Study: Isolated image understanding from general reasoning
- Quadrant Analysis: Classified results into actionable categories
- VLM-as-a-Judge: Diagnosed failures on the most informative subsets ("Image Hurt" + "Both Incorrect")
Next Steps
Multi-Dataset Evaluation: Try the full pipeline from the daft-examples repository that supports evaluating across all 50 Cauldron subsets here.
Experiment Tracking: Wire judge feedback into MLflow or W&B to track improvements over time.
RLVR Training: Use the is_correct signal and judge attributions for reinforcement learning with verifiable rewards.
Resources
Canonical References: - Getting Structured LLM Output (DeepLearning.ai) - Judging LLM-as-a-Judge (NeurIPS 2023)