Transcription, Summaries, and Embeddings at Scale
This tutorial walks through how to build a Voice AI analytics pipeline using Daft and Faster-Whisper from raw audio to searchable, multilingual transcripts. You'll learn how to:
- Transcribe long-form audio using Faster-Whisper with built-in VAD for speech segmentation
- Use Daft's dataframe engine to orchestrate and parallelize multimodal processing at scale
- Generate summaries, translations, and embeddings directly from transcripts
In short, learn how Daft simplifies multimodal AI pipelines letting you process, enrich, and query audio data with the same ease as tabular data.
Introduction to Voice AI
Behind every AI meeting note, podcast summary, and voice agent lies an AI pipeline that transcribes raw audio and enriches those transcripts to make it easy to retrieve for downstream applications.
Voice AI encompasses a broad range of tasks:
- Voice Activity Detection (VAD) - Detects when speech is present in an audio signal
- Speech-to-Text (STT) - The core method of extracting transcriptions from audio
- Speaker Diarization - Identifies and segments which speaker is talking when
- LLM Text Generation - For summaries, translations, and more
- Text-to-Speech (TTS) - Brings LLM responses and translations to life in spoken form
- Turn Detection - Useful for live voice chat
In this tutorial we will focus on Speech-to-Text (STT) and LLM Text Generation, exploring common techniques for preprocessing and enriching speech from audio to support downstream applications like meeting summaries, highlight extraction, and embeddings.
Challenges in Processing Audio for AI Pipelines
Audio is inherently different from traditional structured data. Since audio isn't stored in neat rows and columns in a table, running frontier models on audio data comes with some extra challenges.
Before we can run our STT models on audio data we'll need to:
- Read and preprocess raw audio files into a form that the model can process
- Handle memory constraints (e.g., one hour of 48 kHz/24-bit stereo audio can be close to a gigabyte)
- Decode, buffer, and resample audio files into chunks
Traditional approaches face challenges:
- Scaling parallelism requires multiprocessing/threading (error-prone, GIL limitations)
- Memory management needs custom generators/lazy loading (overflows common)
- Pipelining stages are hardcoded (modifications tedious, no retry mechanisms)
Daft solves these issues by:
- Providing a unified dataframe interface for multimodal data
- Handling distributed parallelism automatically
- Managing memory efficiently with Apache Arrow format
Setup and Imports
Let's start by importing the necessary libraries and setting up our environment.
First, install the required dependencies:
| pip install daft faster-whisper soundfile sentence-transformers python-dotenv openai
|
Then import the necessary modules:
1
2
3
4
5
6
7
8
9
10
11
12
13 | from dataclasses import asdict
import os
import daft
from daft import DataType, col
from daft.functions import format, file, unnest
from daft.functions.ai import prompt, embed_text
from daft.ai.openai.provider import OpenAIProvider
from faster_whisper import WhisperModel, BatchedInferencePipeline
# Load environment variables
from dotenv import load_dotenv
load_dotenv()
|
Define Constants and Configuration
Let's define the parameters we'll use throughout this tutorial.
1
2
3
4
5
6
7
8
9
10
11
12 | # Define Constants
SAMPLE_RATE = 16000
DTYPE = "float32"
BATCH_SIZE = 16
# Define Parameters
SOURCE_URI = "hf://datasets/Eventual-Inc/sample-files/audio/*.mp3"
DEST_URI = ".data/voice_ai_analytics"
LLM_MODEL_ID = "openai/gpt-oss-120b"
EMBEDDING_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
CONTEXT = "Daft: Unified Engine for Data Analytics, Engineering & ML/AI (github.com/Eventual-Inc/Daft) YouTube channel video. Transcriptions can have errors like 'DAF' referring to 'Daft'."
PRINT_SEGMENTS = True
|
Faster-Whisper comes with built-in VAD from Silero for segmenting long-form audio into neat chunks. This makes it so we don't need to worry about the length of video or handle any windowing ourselves since Whisper only operates over 30 sec chunks. We also want to take full advantage of faster-whisper's BatchedInferencePipeline to improve our throughput.
Creating the FasterWhisperTranscriber Class
We'll define a FasterWhisperTranscriber class and decorate it with @daft.cls(). This converts any standard Python class into a distributed massively parallel user-defined-function, enabling us to take full advantage of Daft's rust-backed performance.
Key design decisions:
- We separate model loading from inference in the
__init__ method - Models can easily reach multiple GB in size, so we initialize during class instantiation to avoid repeated downloads
- We input a
daft.File and return a dictionary that will be materialized as a daft.DataType.struct() - Faster-whisper supports reading files directly, so we use
daft.File for simplified preprocessing
Note: Jump to the bottom of this document to see how TranscriptionResult is defined.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 | @daft.cls()
class FasterWhisperTranscriber:
def __init__(self, model="distil-large-v3", compute_type="float32", device="auto"):
self.model = WhisperModel(model, compute_type=compute_type, device=device)
self.pipe = BatchedInferencePipeline(self.model)
@daft.method(return_dtype=TranscriptionResult)
def transcribe(self, audio_file: daft.File):
"""Transcribe Audio Files with Voice Activity Detection (VAD) using Faster Whisper"""
with audio_file.to_tempfile() as tmp:
segments_iter, info = self.pipe.transcribe(
str(tmp.name),
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500, speech_pad_ms=200),
word_timestamps=True,
without_timestamps=False,
temperature=0,
batch_size=BATCH_SIZE,
)
segments = [asdict(seg) for seg in segments_iter]
text = " ".join([seg["text"] for seg in segments])
return {"transcript": text, "segments": segments, "info": asdict(info)}
|
Setting Up OpenAI Provider for LLM Operations
We'll use OpenRouter as our LLM provider for summaries and translations. Let's configure it:
| # Create an OpenAI provider, attach, and set as the default
openrouter_provider = OpenAIProvider(
name="OpenRouter",
base_url="https://openrouter.ai/api/v1",
api_key=os.environ.get("OPENROUTER_API_KEY"),
)
daft.attach_provider(openrouter_provider)
daft.set_provider("OpenRouter")
|
Understanding Daft's DataFrame Interface
Before we dive into transcription, let's understand why Daft's dataframe interface is powerful:
- Tabular Operations: Perform traditional operations within a managed data model - harder to mess up data structures
- Automatic Parallelism: Abstract complexity of orchestrating processing for distributed parallelism - maximum CPU and GPU utilization by default
- Lazy Evaluation: Operations aren't materialized until we invoke collection - enables query optimization and decouples transformations from load
Daft's execution engine runs on a push-based processing model, enabling the engine to optimize each operation by planning everything from query through the logic and finally writing to disk.
Step 1: Transcription
Now let's transcribe our audio files:
- Discover audio files from the source URI
- Wrap paths as
daft.File objects - Transcribe using our FasterWhisperTranscriber
- Unpack the results into separate columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 | # Instantiate Transcription UDF
fwt = FasterWhisperTranscriber()
# Transcribe the audio files
df_transcript = (
# Discover the audio files
daft.from_glob_path(SOURCE_URI)
# Wrap the path as a daft.File
.with_column("audio_file", file(col("path")))
# Transcribe the audio file with Voice Activity Detection (VAD) using Faster Whisper
.with_column("result", fwt.transcribe(col("audio_file")))
# Unpack Results
.select("path", "audio_file", unnest(col("result")))
).collect()
print(
"\n\nRunning Transcription with Voice Activity Detection (VAD) using Faster Whisper..."
)
|
| # Show the transcript
df_transcript.select(
"path",
"info",
"transcript",
"segments",
).show(3, format="fancy", max_width=40)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 | โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฎ
โ path โ info โ transcript โ segments โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโก
โ hf://datasets/Eventual-Inc/sample-filโฆ โ {language: en, โ Hi, I'm Kevin. Let's talk batch infeโฆ โ [{id: 1, โ
โ โ language_probability: โฆ โ โ seek: 0, โ
โ โ โ โ start: 0.09, โ
โ โ โ โ end: 2โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ {language: en, โ I'm climbing today. Peor. What areโฆ โ [{id: 1, โ
โ โ language_probability: โฆ โ โ seek: 0, โ
โ โ โ โ start: 0.76, โ
โ โ โ โ end: 1โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ {language: en, โ Hi, I'm Colin. I'm a software engineโฆ โ [{id: 1, โ
โ โ language_probability: โฆ โ โ seek: 0, โ
โ โ โ โ start: 0.15, โ
โ โ โ โ end: 2โฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโฏ
(Showing first 7 of 7 rows)
|
Great! We've successfully transcribed our audio files. The dataframe now contains:
path: The source file path transcript: The full transcription text segments: A list of transcription segments with timestamps info: Metadata about the transcription (language, duration, etc.)
Step 2: Summarization
Moving on to our downstream enrichment stages, summarization is a common and simple means of leveraging an LLM for publishing, socials, or search. With Daft, generating a summary from your transcripts is as simple as adding a column.
We'll also demonstrate how easy it is to add translations - since all the data is organized and accessible, we just need to declare what we want!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27 | # Summarize the transcripts and translate to Chinese
df_summaries = (
df_transcript
# Summarize the transcripts
.with_column(
"summary",
prompt(
format(
"Summarize the following transcript from a YouTube video belonging to {}: \n {}",
daft.lit(CONTEXT),
col("transcript"),
),
model=LLM_MODEL_ID,
),
).with_column(
"summary_chinese",
prompt(
format(
"Translate the following text to Simplified Chinese: <text>{}</text>", col("summary")
),
system_message="You will be provided with a piece of text. Your task is to translate the text to Simplified Chinese exactly as it is written. Return the translated text only, no other text or formatting.",
model=LLM_MODEL_ID,
),
)
)
print("\n\nGenerating Summaries...")
|
| # Show the summaries and the transcript
df_summaries.select(
"path",
"transcript",
"summary",
"summary_chinese",
).show(format="fancy", max_width=40)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 | โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ path โ transcript โ summary โ summary_chinese โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Hi, I'm Kevin, engineer at Eventual,โฆ โ **Video Summary โ โSparkโฏConnect for โฆ โ **่ง้ขๆ่ฆ โ โSparkโฏConnect for Daftโ๏ผDafโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Hi, I'm Colin. I'm a software engineโฆ โ **Video Summary โ โUnified Engine forโฆ โ **่ง้ขๆ่ฆ โ โ็ปไธ็ๆฐๆฎๅๆใๅทฅ็จไธโฏML/AI ๅผๆ (Daft)โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Okay, so I have a cluster running wiโฆ โ **Video Summary โ โUnified Engine forโฆ โ **่ง้ขๆ่ฆ โ โ็ปไธ็็จไบๆฐๆฎๅๆใๅทฅ็จๅโฏML/AI ็ๅผๆโ๏ผDaโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Hi, I'm Kevin. Let's talk batch infeโฆ โ **Video Summary โ โBatch Inference wiโฆ โ 3. **ๆง่ก** โ `daft.run()` ๆง่ก่ฏฅๆไฝใ โ
โ โ โ โ โ
โ โ โ โ - โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Hi, I'm Colin, a software engineer aโฆ โ **Video Summary โ โUnified Engine forโฆ โ **่ง้ขๆ่ฆ โ โ็ปไธ็ๆฐๆฎๅๆใๅทฅ็จไธโฏML/AI ๅผๆโ๏ผDaft๏ผโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Real-old data is messy. There's an eโฆ โ **Summary of the Daft โUnified Engineโฆ โ **Daft โ็ปไธ็ๆฐๆฎๅๆใๅทฅ็จไธโฏML/AI ๅผๆโ ่ง้ขๆ่ฆ** โ
โ โ โ โ โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ I'm climbing today. Peor. What areโฆ โ **Summary** โ **ๆ่ฆ** โ
โ โ โ โ โ
โ โ โ The video opens with a bโฆ โ ่ง้ขไปฅไธๆฌก็ฎ็ญ็้ๆๅฏน่ฏๅผๅบ๏ผ่ฎจ่ฎบๆๅฒฉ็ญ็บง๏ผไป
ไฝไธบ่ฝปๆพ็ๅผโฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
(Showing first 7 of 7 rows)
|
Excellent! We now have summaries in both English and Chinese. This demonstrates how easy it is to add multilingual support to your pipeline.
Step 3: Generating Subtitles
A common downstream task is preparing subtitles. Since our segments come with start and end timestamps, we can easily add another section to our Voice AI pipeline for translation. We'll explode the segments (one row per segment) and translate each segment to Simplified Chinese.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 | # Explode the segments, embed, and translate to simplified Chinese for subtitles
df_segments = (
df_transcript.explode("segments")
.select(
"path",
unnest(col("segments")),
)
.with_column(
"segment_text_chinese",
prompt(
format("Translate the following text to Simplified Chinese: <text>{}</text>", col("text")),
system_message="You will be provided with a transcript segment. Your task is to translate the text to Simplified Chinese exactly as it is written. Return the translated text only, no other text or formatting.",
model=LLM_MODEL_ID,
),
)
)
print("\n\nGenerating Chinese Subtitles...")
|
| # Show the segments and translations
df_segments.select(
"path",
col("text"),
"segment_text_chinese",
).show(format="fancy", max_width=40)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 | โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ path โ text โ segment_text_chinese โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Then we're using DAF's LLM generate โฆ โ ็ถๅๆไปฌๅจๆฐๆฎ้็ prompts ๅไธไฝฟ็จ DAF ็ LLM ็ๆๅฝๆฐโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ So here we're using DAF to read a CSโฆ โ <text> ๆไปฅ่ฟ้ๆไปฌไฝฟ็จ DAF ไป Hugging Face ่ฏปๅโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ With DAF's LLM generate function, thโฆ โ ไฝฟ็จ DAF ็ LLM ็ๆๅฝๆฐ๏ผ่ฟ้ๅธธๅฎนๆๅฎ็ฐใโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ So let me just run the code first whโฆ โ <text> ๆไปฅ่ฎฉๆๅ
่ฟ่กไปฃ็ ๏ผๅๆถ่งฃ้ไธไธๅ็ไบไปไนใ </text>โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ specifying that we want to run it onโฆ โ <text>ๆๅฎๆไปฌๆณ่ฆๅจ Open AI ๆไพ็ GPT5 Nano ๆจกโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ awesome chat GPT prompts data set. โ <text> ่ถ
ๆฃ็ chat GPT ๆ็คบๆฐๆฎ้ใ</text>โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Hi, I'm Kevin. Let's talk batch infeโฆ โ ๅจ๏ผๆๆฏๅฏๆใ่ฎฉๆไปฌ่ฐ่ฐๆน้ๆจ็ใโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Say you have a dataset of prompts thโฆ โ <text> ๅ่ฎพไฝ ๆไธไธชๆ็คบๆฐๆฎ้๏ผๆณ่ฆๅฐๅ
ถ่ฟ่กๅจ GPT ไธใ</โฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
|
Perfect! These segments can now be used to make content more accessible for wider audiences, which is a great way to increase reach. Each segment has:
- Original text with timestamps (
start, end) - Chinese translation
- Ready to use for subtitle generation
Step 4: Embedding Segments for Later Retrieval
Our final stage is embeddings. If you're going through the trouble of transcription, you might as well make that content available as part of your knowledge base. Meeting notes might not be the most advanced AI use-case anymore, but it still provides immense value for tracking decisions and key moments in discussions.
Adding an embeddings stage is as simple as calling embed_text():
1
2
3
4
5
6
7
8
9
10
11
12
13 | # Embed the segments
df_segments = (
df_segments.with_column(
"segment_embeddings",
embed_text(
col("text"),
provider="transformers",
model=EMBEDDING_MODEL_ID,
),
)
)
print("\n\nGenerating Embeddings for Segments...")
|
| # Show the segments with embeddings
df_segments.select(
"path",
"text",
"segment_embeddings",
).show(format="fancy", max_width=40)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ path โ text โ segment_embeddings โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Hi, I'm Kevin. Let's talk batch infeโฆ โ โโโโโโโโโโโโโ
โโโโโ
โโโโโ
โ
โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Say you have a dataset of prompts thโฆ โ โโโโโโโ
โโโโโโ
โ
โโโโโโโโโโโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ With DAF's LLM generate function, thโฆ โ โโโโโโโโโโโโโโโโโโโโโโโโโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ So let me just run the code first whโฆ โ โโโโโโโโโโโโโโโ
โโโโ
โโโโโโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ So here we're using DAF to read a CSโฆ โ โโโโโโ
โโโ
โ
โโ
โ
โโโโโโโโโ
โ
โ
โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ awesome chat GPT prompts data set. โ โโโโโโโโโโ
โโโ
โโ
โโโ
โโ
โโโโโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ Then we're using DAF's LLM generate โฆ โ โโ
โโโ
โโโโโโโโโโโโโโโโ
โโโโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/sample-filโฆ โ specifying that we want to run it onโฆ โ โโโ
โโโโโ
โโ
โโโ
โโโโโโโโโโโ
โฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
(Showing first 8 rows)
|
Excellent! Daft's native embedding DataType intelligently stores embedding vectors for you, regardless of their size. Now you have:
- Transcript segments with timestamps
- Embeddings ready for semantic search
- Translations for multilingual support
Summary
We've successfully built a complete Voice AI Analytics pipeline that:
- โ
Ingests a directory of audio files
- โ
Transcribes speech to text using Faster-Whisper with VAD
- โ
Generates summaries from the transcripts
- โ
Translates transcript segments to Chinese for subtitles
- โ
Embeds transcriptions for future semantic search
Extensions and Next Steps
From here there are several directions you could take:
1. Q/A Chatbot
Leverage the embeddings to host a Q/A chatbot that enables listeners to engage with content across episodes:
- "What did Sam Harris say about free will in episode 267?"
- "Find all discussions about AI safety across my subscribed podcasts"
2. Recommendation Engine
Build recommendation engines that surface hidden gems based on semantic similarity rather than just metadata tags.
3. Dynamic Highlight Reels
Create dynamic highlight reels that auto-generate shareable clips based on sentiment spikes and topic density.
4. RAG Workflow
Leverage Daft's cosine_distance function to put together a full RAG (Retrieval-Augmented Generation) workflow for an interactive experience.
5. Analytics Dashboards
Use the same tooling to power analytics dashboards showcasing trending topics, or supply content for automated newsletters. Since everything you store is queryable and performant, the only limit is your imagination!
Key Takeaways
- Daft simplifies multimodal AI pipelines - Process, enrich, and query audio data with the same ease as tabular data
- Automatic parallelism - Maximum CPU and GPU utilization by default
- Lazy evaluation - Optimized query planning and efficient resource usage
- Easy extensibility - Adding new stages (summaries, translations, embeddings) is just another line of code
- No manual orchestration - No need to handle VAD, batching, or multiprocessing manually
Conclusion
At Eventual, we're simplifying multimodal AI so you don't have to. Managing voice AI pipelines or processing thousands of hours of podcast audio ultimately comes down to a few core needs:
- Transcripts so your content is accessible and searchable
- Summaries so your listeners can skim and find what matters
- Translations so you can localize your content to your audience
- Embeddings so people can ask questions like "Which episode talked about reinforcement learning?"
Traditionally, delivering all of this meant juggling multiple tools, data formats, and scaling headaches, a brittle setup that doesn't grow with your workload. With Daft, you get one unified engine to process, store, and query multimodal data efficiently.
Fewer moving parts means fewer failure points, less debugging, and a much shorter path from raw audio to usable insights.
For more examples and to get help, check out:
Appendix: TranscriptionResult Definition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95 | from daft import DataType
WordStruct = DataType.struct(
{
"start": DataType.float64(),
"end": DataType.float64(),
"word": DataType.string(),
"probability": DataType.float64(),
}
)
SegmentStruct = DataType.struct(
{
"id": DataType.int64(),
"seek": DataType.int64(),
"start": DataType.float64(),
"end": DataType.float64(),
"text": DataType.string(),
"tokens": DataType.list(DataType.int64()),
"avg_logprob": DataType.float64(),
"compression_ratio": DataType.float64(),
"no_speech_prob": DataType.float64(),
"words": DataType.list(WordStruct),
"temperature": DataType.float64(),
}
)
TranscriptionOptionsStruct = DataType.struct(
{
"beam_size": DataType.int64(),
"best_of": DataType.int64(),
"patience": DataType.float64(),
"length_penalty": DataType.float64(),
"repetition_penalty": DataType.float64(),
"no_repeat_ngram_size": DataType.int64(),
"log_prob_threshold": DataType.float64(),
"no_speech_threshold": DataType.float64(),
"compression_ratio_threshold": DataType.float64(),
"condition_on_previous_text": DataType.bool(),
"prompt_reset_on_temperature": DataType.float64(),
"temperatures": DataType.list(DataType.float64()),
"initial_prompt": DataType.python(),
"prefix": DataType.string(),
"suppress_blank": DataType.bool(),
"suppress_tokens": DataType.list(DataType.int64()),
"without_timestamps": DataType.bool(),
"max_initial_timestamp": DataType.float64(),
"word_timestamps": DataType.bool(),
"prepend_punctuations": DataType.string(),
"append_punctuations": DataType.string(),
"multilingual": DataType.bool(),
"max_new_tokens": DataType.float64(),
"clip_timestamps": DataType.python(),
"hallucination_silence_threshold": DataType.float64(),
"hotwords": DataType.string(),
}
)
VadOptionsStruct = DataType.struct(
{
"threshold": DataType.float64(),
"neg_threshold": DataType.float64(),
"min_speech_duration_ms": DataType.int64(),
"max_speech_duration_s": DataType.float64(),
"min_silence_duration_ms": DataType.int64(),
"speech_pad_ms": DataType.int64(),
}
)
LanguageProbStruct = DataType.struct(
{
"language": DataType.string(),
"probability": DataType.float64(),
}
)
InfoStruct = DataType.struct(
{
"language": DataType.string(),
"language_probability": DataType.float64(),
"duration": DataType.float64(),
"duration_after_vad": DataType.float64(),
"all_language_probs": DataType.list(LanguageProbStruct),
"transcription_options": TranscriptionOptionsStruct,
"vad_options": VadOptionsStruct,
}
)
TranscriptionResult = DataType.struct(
{
"transcript": DataType.string(),
"segments": DataType.list(SegmentStruct),
"info": InfoStruct,
}
)
|