Working with Text#
This how-to guide shows you how to accomplish common text processing tasks with Daft:
Generate text embeddings#
Text embeddings convert text into numerical vectors that capture semantic meaning. Use them for semantic search, similarity calculations, and other NLP tasks.
How to use the embed_text function#
By default, embed_text uses the Sentence Transformers provider, which requires the sentence-transformers optional dependency.
1 | |
Once installed, we can run:
1 2 3 4 5 6 7 8 | |
How to use different providers#
Using Sentence Transformers#
Sentence Transformers is a popular module for computing embeddings.
First install the optional Sentence Transformers dependency for Daft.
1 | |
Then use the transformers provider with any desired open model hosted on Hugging Face such as BAAI/bge-base-en-v1.5.
1 2 3 4 5 6 7 8 9 10 11 | |
Using OpenAI#
OpenAI is a popular choice for generating text embeddings.
First install the optional OpenAI dependency for Daft.
1 | |
You will also need to set your OPENAI_API_KEY environment variable.
Then use the openai provider with any desired OpenAI embedding model such as text-embedding-3-small.
1 2 3 4 5 6 7 8 9 10 11 | |
Model Constraints
Different embedding models have different constraints. For example, OpenAI's text-embedding-3-small model has a maximum context length of 8,192 tokens. This means you might encounter error messages like
1 | |
In this case you could either use a different model with a larger maximum context length, or could chunk your text into smaller segments before generating embeddings. See our text embeddings guide for examples of text chunking strategies, or refer to the section below on text chunking.
Using LM Studio#
LM Studio is a local AI model platform that lets you run Large Language Models like Qwen, Mistral, Gemma, or gpt-oss on your own machine. If you're running an LM studio server, Daft can use it as a provider for computing embeddings.
First install the optional OpenAI dependency for Daft. This is needed because LM studio uses an OpenAI-compatible API.
1 | |
LM Studio runs on localhost port 1234 by default, but you can customize the base_url as needed in Daft. In this example, we use the nomic-ai/nomic-embed-text-v1.5 embedding model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
How to work with embeddings#
It's common to use embeddings for various tasks like similarity search or retrieval with a vector database.
Check out our guide on writing to turbopuffer to work with a popular fast vector database.
Chunk text into smaller pieces#
When working with large text documents, you often need to break them into smaller chunks.
How to chunk by sentences#
A popular library for sentence chunking is spaCy.
First, install spaCy and a spaCy model such as en_core_web_sm.
1 2 | |
Then, create a User-defined Function that uses spaCy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
For a fuller discussion on text chunking strategies, check out the text chunking section in our tutorial on text embeddings.
More examples#
Check out our end-to-end tutorial for a complete workflow: chunking text, generating embeddings, and uploading to vector databases like turbopuffer.