Generate Text Embeddings for Turbopuffer with Daft#
In this example, we demonstrate how to build a text embedding pipeline for processing large text datasets and storing these embeddings in Turbopuffer. You'll learn how to:
- Process millions of text documents in parallel with distributed computing
- Text chunking best practices for optimal embedding quality
- Generate high-quality embeddings using state-of-the-art models like Qwen3
- Perform distributed writes to vector databases like Turbopuffer
- Scale across multiple GPUs with tips for maximizing GPU utilization
What is Turbopuffer? Turbopuffer is a vector database that allows you to store and search through high-dimensional embeddings efficiently. It's designed for production workloads and provides fast similarity search capabilities.
What are embeddings? An embedding is a representation of data (text, images, audio etc.), often a vector of numerical values, that encodes semantic information. These embeddings can then be used in many applications such as semantic search, deduplication, multi-lingual applications, and so on.
By the end, you should be able to run a text embedding pipeline on a cluster and achieve near 100% GPU utilization for your workloads.

Pipeline Overview#
Our pipeline will:
- Read text data from Parquet files in S3
- Split text into sentences using spaCy
- Generate embeddings using a Qwen3 model
- Write results to Turbopuffer
Prerequisites#
Before starting, install the required dependencies and download the spaCy model for text chunking:
1 2 | |
You will also need AWS access. Individual methods may vary, but once set up you can login via:
1 | |
Step 1: Import Dependencies and Configure Constants#
1 2 3 4 5 6 7 8 9 10 11 | |
Step 2: Create Text Chunking UDF#
Understanding Text Chunking#
When creating embeddings, it's useful to split your text into meaningful chunks. Text is hierarchical and can be broken down at different levels: Document → Sections → Paragraphs → Sentences → Words → Characters. The chunking strategy to use depends on your use case.
Chunking Strategies#
- Sentence-level chunking works well for most use cases, especially when the document structure is unclear or inconsistent.
- Paragraph-level chunking is good for RAG (Retrieval-Augmented Generation) applications where maintaining context across sentences is important.
- Section-level chunking is useful for long documents that have clear structural divisions.
- Fixed-size chunks are simple to implement but may break semantic meaning at arbitrary boundaries.
When to Use Each Approach#
- Sentence splitting is the default choice when you're unsure about the document structure or when working with diverse content types.
- Paragraph splitting is preferred for RAG systems where maintaining context across multiple sentences matters for retrieval quality.
- Custom splitting is necessary for specialized content like tweets, text messages, or code that don't follow standard paragraph structures.
Implementation#
We'll use sentence-level chunking in this example.
We'll also use spaCy, which is a natural language processing library that provides robust sentence boundary detection that handles edge cases better than simple punctuation-based splitting.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | |
This User-Defined Function (UDF):
- Loads the spaCy model once per UDF during initialization for efficiency
- Processes batches of text (
text_col) to minimize overhead - Returns a list of sentence chunks with unique chunk IDs
- Runs multiple instances in parallel
(NUM_GPU_NODES * CHUNKING_PARALLELISM = 64 total instances)for distributed processing
Step 3: Create Embedding Generation UDF#
Choosing a Text Embedding Model#
The quality of your embeddings depends heavily on the model you choose. Here are some key considerations:
Model Performance
- MTEB Leaderboard: Check the Massive Text Embedding Benchmark (MTEB) leaderboard for the latest performance rankings across various tasks
- Task-specific performance: Different models excel at different tasks (semantic search, clustering, classification, etc.)
- Multilingual support: Consider if you need to process text in multiple languages
Some Popular Models
- Qwen3-Embedding-0.6B: Good performance-to-size ratio, state-of-the-art, used in this example
- all-MiniLM-L6-v2: The default used in Sentence Transformer's documentation, often used in tutorials
- gemini-embedding-001: The current top multilingual model on MTEB. Requires Gemini API access
With open models available on HuggingFace, you can easily swap models by changing the EMBEDDING_MODEL_NAME constant in the code below.
We'll create a UDF to generate embeddings from the chunked text:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | |
This UDF:
- Loads the SentenceTransformer model on GPU if available
- Uses
bfloat16precision to reduce memory usage - Processes text in batches (
SENTENCE_TRANSFORMER_BATCH_SIZE = 128) for optimal GPU utilization - Returns numpy arrays which are compatible with Daft
Step 4: Configure Distributed Processing#
You can run this script locally, but if you're interested in running this pipeline on a cluster, check out our guide on scaling up. In this example, we ran on a ray cluster with 8 g5.2xlarge workers (each comes with an A10G GPU). To configure our Daft script to use the ray cluster, we added:
1 2 3 4 5 6 7 8 9 | |
Step 5: Execute the Pipeline#
Now we'll execute the complete data processing pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Pipeline steps explained:
- Read data: Load Parquet files from S3 with large chunk size for efficiency
- Chunk text: Apply sentence splitting UDF
- Explode: Flatten the list of sentences into separate rows
- Extract fields: Get text and chunk_id from the sentence structs
- Generate embeddings: Apply embedding UDF to text
- Create IDs: Generate unique IDs combining URL and chunk_id
- Select columns: Keep only the necessary columns
- Write to Turbopuffer: Store data and vectors in Turbopuffer using Daft's
DataFrame.write_turbopuffermethod
If all works out well, when you run this script on your cluster, you should notice that network I/O, CPU work, and GPU work are pipelined to run in parallel, and you should see high GPU utilization :)
Customization Tips#
- Adjust batch sizes: Increase
SENTENCE_TRANSFORMER_BATCH_SIZEfor better throughput, decrease for lower GPU memory usage - Scale workers: Modify
NUM_GPU_NODESandCHUNKING_PARALLELISMbased on your cluster size and cores available per node - Change models: Replace
EMBEDDING_MODEL_NAMEwith other SentenceTransformer models - Different chunking: Modify
ChunkingUDFto use different text chunking strategies - Alternative vector databases: Replace with other vector databases like Lance, Pinecone, or Chroma
Performance Considerations#
- GPU memory: Monitor GPU memory usage and adjust batch sizes accordingly. If your GPUs fail to allocate sufficient memory or you exceed the max sequence length of your embedding model,
SENTENCE_TRANSFORMER_BATCH_SIZEmay be too large - Model loading: UDFs load models once per worker, so initialization time is amortized
- Quantization: Use
bfloat16orfloat16quantization for lower GPU memory utilization and higher throughput.
This pipeline can efficiently process millions of text documents while automatically scaling across your available compute resources.
Complete Script#
Here's the complete script you can run:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |