Working with Embeddings
Embeddings transform text, images, and other data into dense vector representations that capture semantic meaning—enabling similarity search, retrieval-augmented generation (RAG), and AI-powered discovery. Daft makes it easy to generate, store, and query embeddings at scale.
With the native daft.DataType.embedding type and embed_text function, you can:
- Generate embeddings from any text column using providers like OpenAI, Cohere, or local models
- Compute similarity with built-in distance functions like
cosine_distance - Build search pipelines that scale from local development to distributed clusters
- Write to vector databases like Turbopuffer, Pinecone, or LanceDB
Semantic Search Example
The following example creates a simple semantic search pipeline—embedding documents, comparing them to a query, and ranking by similarity:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49 | import daft
from daft.functions import embed_text, cosine_distance
# Create a knowledge base with documents
documents = daft.from_pydict(
{
"text": [
"Python is a high-level programming language",
"Machine learning models require training data",
"Daft is a distributed dataframe library",
"Embeddings capture semantic meaning of text",
],
}
)
# Embed all documents
documents = documents.with_column(
"embedding",
embed_text(
daft.col("text"),
provider="openai",
model="text-embedding-3-small",
),
)
# Create a query
query = daft.from_pydict({"query_text": ["What is Daft?"]})
# Embed the query
query = query.with_column(
"query_embedding",
embed_text(
daft.col("query_text"),
provider="openai",
model="text-embedding-3-small",
),
)
# Cross join to compare query against all documents
results = query.join(documents, how="cross")
# Calculate cosine distance (lower is more similar)
results = results.with_column(
"distance", cosine_distance(daft.col("query_embedding"), daft.col("embedding"))
)
# Sort by distance and show top results
results = results.sort("distance").select("query_text", "text", "distance", "embedding")
results.show()
|
| Output |
|---|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | ╭───────────────┬────────────────────────────────┬────────────────────┬──────────────────────────╮
│ query_text ┆ text ┆ distance ┆ embedding │
│ --- ┆ --- ┆ --- ┆ --- │
│ String ┆ String ┆ Float64 ┆ Embedding[Float32; 1536] │
╞═══════════════╪════════════════════════════════╪════════════════════╪══════════════════════════╡
│ What is Daft? ┆ Daft is a distributed datafra… ┆ 0.3621492191359764 ┆ ▄▇▆▅▄▄█▆▄▄▃▂▄▃▃▃▁▄▃▃▄▄▃▂ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ What is Daft? ┆ Python is a high-level progra… ┆ 0.9163975397319742 ┆ ▇▆▅▇▅▆█▇▃▄▆▄▄▁▅▄▅▃▁▃▃▂▅▃ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ What is Daft? ┆ Embeddings capture semantic m… ┆ 0.9374004015203741 ┆ ▄█▅▄▅▅▅▇▄▃▂▁▃▄▄▁▃▃▂▂▂▂▁▃ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ What is Daft? ┆ Machine learning models requi… ┆ 0.9696998373223874 ┆ ▇▇▆▃▄▆▅█▆▂▄▃▄▄▂▄▂▁▂▂▁▃▂▁ │
╰───────────────┴────────────────────────────────┴────────────────────┴──────────────────────────╯
(Showing first 4 of 4 rows)
|
Building a Document Search Pipeline
For production use cases, you'll typically combine embeddings with LLM-powered metadata extraction and write the results to a vector database.
This example shows an end-to-end pipeline that:
- Loads PDF documents from cloud storage
- Extracts structured metadata using an LLM
- Generates vector embeddings from the abstracts
- Writes everything to Turbopuffer for semantic search
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 | # /// script
# description = "This example shows how using LLMs and embedding models, Daft chunks documents, extracts metadata, generates vectors, and writes them to any vector database..."
# dependencies = ["daft[openai, turbopuffer]", "pymupdf"]
# ///
import os
import daft
from daft import col, lit
from daft.functions import embed_text, prompt, file, unnest, monotonically_increasing_id
from pydantic import BaseModel
class Classifier(BaseModel):
title: str
author: str
year: int
keywords: list[str]
abstract: str
daft.set_execution_config(enable_dynamic_batching=True)
daft.set_provider("openai", api_key=os.environ.get("OPENAI_API_KEY"))
# Load documents and generate vector embeddings
df = (
daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/papers/*.pdf").limit(10)
.with_column(
"metadata",
prompt(
messages=file(col("path")),
system_message="Read the paper and extract the classifier metadata.",
return_format=Classifier,
model="gpt-5-mini",
)
)
.with_column(
"abstract_embedding",
embed_text(
daft.col("metadata")["abstract"],
model="text-embedding-3-large"
)
)
.with_column("id", monotonically_increasing_id())
.select("id", "path", unnest(col("metadata")), "abstract_embedding")
)
# Write to Turbopuffer
df.write_turbopuffer(
namespace="ai_papers",
api_key=os.environ.get("TURBOPUFFER_API_KEY"),
distance_metric="cosine_distance",
region='us-west-2',
schema={
"id": "int64",
"path": "string",
"title": "string",
"author": "string",
"year": "int",
"keywords": "list[string]",
"abstract": "string",
"abstract_embedding": "vector",
}
)
|