File Types

The File DataType provides first-class support for handling file data across local and remote storage, enabling seamless file operations in distributed environments.

File #

File(url: str, io_config: IOConfig | None = None, media_type: MediaType = unknown(), position: int | None = None, size: int | None = None, offset: int | None = None, length: int | None = None)

A file-like object for working with file contents in Daft.

This is an abstract base class that provides a standard file interface compatible with Python's file protocol.

The File object can be used with most Python libraries that accept file-like objects, and implements the standard read/seek/tell interface. Files are read-only in the current implementation.

Examples:

>>> import daft
>>> from daft.functions import file
>>> df = daft.from_pydict({"paths": ["data.json"]})
>>> df = df.select(file(df["paths"]))
>>>
>>> @daft.func
>>> def read_json(file: daft.File) -> str:
>>>     import json
>>>     with file.open() as f:
>>>         data = json.load(f)
>>>         return data["text"]

Methods:

Name	Description
`as_audio`	Convert to AudioFile if this file contains audio data.
`as_hdf5`	Convert to Hdf5File if this file contains HDF5 data.
`as_image`	Convert to ImageFile if this file contains image data.
`as_video`	Convert to VideoFile if this file contains video data.
`exists`	Whether the file exists at its path or URL.
`is_audio`
`is_hdf5`
`is_image`
`is_video`
`isatty`
`mime_type`	Attempts to determine the MIME type of the file.
`open`
`readable`
`seekable`
`size`	The size of the file in bytes, derived from the underlying file.
`to_tempfile`	Create a temporary file with the contents of this file.
`writable`

Attributes:

Name	Type	Description
`length`	`int \| None`	Deprecated alias for the byte-range read window size, or None for full-file reads.
`name`	`str`	The filename (basename) extracted from the file path or URL.
`offset`	`int \| None`	Deprecated alias for `position`. The byte offset for range reads, or None for full-file reads.
`path`	`str`	The full path or URL of the file.
`position`	`int \| None`	The starting byte position for range reads, or None for full-file reads.

Source code in daft/file/file.py

def __init__(
    self,
    url: str,
    io_config: IOConfig | None = None,
    media_type: MediaType = MediaType.unknown(),
    position: int | None = None,
    size: int | None = None,
    offset: int | None = None,
    length: int | None = None,
) -> None:
    if offset is not None:
        warnings.warn(
            "`offset` is deprecated; use `position` instead.",
            DeprecationWarning,
            stacklevel=2,
        )
        if position is None:
            position = offset

    if length is not None:
        warnings.warn(
            "`length` is deprecated; use `size` instead.",
            DeprecationWarning,
            stacklevel=2,
        )
        if size is None:
            size = length

    self._inner = PyFileReference._from_tuple((media_type._media_type, url, io_config, position, size))  # type: ignore

length #

length: int | None

Deprecated alias for the byte-range read window size, or None for full-file reads.

Note: this returns the requested range size (caller intent), not the derived file size. Use File.size() for the actual file size.

name #

name: str

The filename (basename) extracted from the file path or URL.

Returns:

Name	Type	Description
`str`	`str`	The filename without directory components.

Example

import daft f = daft.File("s3://bucket/path/to/data.csv") f.name 'data.csv'

offset #

offset: int | None

Deprecated alias for position. The byte offset for range reads, or None for full-file reads.

path #

path: str

The full path or URL of the file.

Returns:

Name	Type	Description
`str`	`str`	The file path or URL.

Example

import daft f = daft.File("s3://bucket/path/to/data.csv") f.path 's3://bucket/path/to/data.csv'

position #

position: int | None

The starting byte position for range reads, or None for full-file reads.

as_audio #

as_audio() -> AudioFile

Convert to AudioFile if this file contains audio data.

Source code in daft/file/file.py

def as_audio(self) -> AudioFile:
    """Convert to AudioFile if this file contains audio data."""
    if not sf.module_available():
        raise ImportError(
            "The 'soundfile' module is required to convert files to audio. "
            "Please install it with: pip install 'daft[audio]'"
        )
    # this is purposely inside the function, and after the `sf` check
    # because using AudioFile means that the user has `sf` installed
    from daft.file.audio import AudioFile

    if not self.is_audio():
        raise ValueError(f"File {self} is not an audio file")

    cls = AudioFile.__new__(AudioFile)
    cls._inner = self._inner

    return cls

as_hdf5 #

as_hdf5() -> Hdf5File

Convert to Hdf5File if this file contains HDF5 data.

Source code in daft/file/file.py

def as_hdf5(self) -> Hdf5File:
    """Convert to Hdf5File if this file contains HDF5 data."""
    if not h5py.module_available():  # ty:ignore[unresolved-attribute]
        raise ImportError(
            "The 'h5py' module is required to convert files to HDF5. Please install it with: pip install 'h5py'"
        )
    from daft.file.hdf5 import Hdf5File

    if not self.is_hdf5():
        raise ValueError(f"File {self} is not an HDF5 file")

    cls = Hdf5File.__new__(Hdf5File)
    cls._inner = self._inner

    return cls

as_image #

as_image() -> ImageFile

Convert to ImageFile if this file contains image data.

Source code in daft/file/file.py

def as_image(self) -> ImageFile:
    """Convert to ImageFile if this file contains image data."""
    if not pil_image.module_available():
        raise ImportError(
            "The 'pillow' module is required to convert files to images. "
            "Please install it with: pip install 'daft[image]'"
        )
    from daft.file.image import ImageFile

    if not self.is_image():
        raise ValueError(f"File {self} is not an image file")

    cls = ImageFile.__new__(ImageFile)
    cls._inner = self._inner

    return cls

as_video #

as_video() -> VideoFile

Convert to VideoFile if this file contains video data.

Source code in daft/file/file.py

def as_video(self) -> VideoFile:
    """Convert to VideoFile if this file contains video data."""
    if not av.module_available():
        raise ImportError("The 'av' module is required to convert files to video.")
    # this is purposely inside the function, and after the `av` check
    # because using VideoFile means that the user has `av` installed
    from daft.file.video import VideoFile

    if not self.is_video():
        raise ValueError(f"File {self} is not a video file")

    cls = VideoFile.__new__(VideoFile)
    cls._inner = self._inner

    return cls

exists #

exists() -> bool

Whether the file exists at its path or URL.

Source code in daft/file/file.py

def exists(self) -> bool:
    """Whether the file exists at its path or URL."""
    return self._inner.exists()

is_audio #

is_audio() -> bool

Source code in daft/file/file.py

def is_audio(self) -> bool:
    mimetype = self.mime_type()
    if mimetype.startswith("audio/"):
        return True
    return False

is_hdf5 #

is_hdf5() -> bool

Source code in daft/file/file.py

def is_hdf5(self) -> bool:
    mimetype = self.mime_type()
    return mimetype == "application/vnd.hdfgroup.hdf5"

is_image #

is_image() -> bool

Source code in daft/file/file.py

def is_image(self) -> bool:
    mimetype = self.mime_type()
    if mimetype.startswith("image/"):
        return True
    return False

is_video #

is_video() -> bool

Source code in daft/file/file.py

def is_video(self) -> bool:
    mimetype = self.mime_type()
    if mimetype.startswith("video/"):
        return True
    return False

isatty #

isatty() -> bool

Source code in daft/file/file.py

def isatty(self) -> bool:
    return False

mime_type #

mime_type() -> str

Attempts to determine the MIME type of the file.

If the MIME type is undetectable, returns 'application/octet-stream'.

Source code in daft/file/file.py

def mime_type(self) -> str:
    """Attempts to determine the MIME type of the file.

    If the MIME type is undetectable, returns 'application/octet-stream'.
    """
    try:
        with self.open(buffer_size=BUFFER_SNIFF) as f:
            maybe_mime_type = f.guess_mime_type()
            return maybe_mime_type if maybe_mime_type else "application/octet-stream"
    except FileNotFoundError:
        if self.path.lower().endswith((".h5", ".hdf5")):
            return "application/vnd.hdfgroup.hdf5"
        maybe_mime_type, _ = mimetypes.guess_type(self.path)
        return maybe_mime_type if maybe_mime_type else "application/octet-stream"

open #

open(buffer_size: int | None = None) -> PyDaftFile

Source code in daft/file/file.py

def open(self, buffer_size: int | None = None) -> PyDaftFile:
    if self.position is None and self._inner.size() is None and not self.exists():
        raise FileNotFoundError(f"File {self.path} does not exist")
    return PyDaftFile._from_file_reference(self._inner, buffer_size=buffer_size)

readable #

readable() -> bool

Source code in daft/file/file.py

def readable(self) -> bool:
    return True

seekable #

seekable() -> bool

Source code in daft/file/file.py

def seekable(self) -> bool:
    return True

size #

size() -> int

The size of the file in bytes, derived from the underlying file.

Source code in daft/file/file.py

def size(self) -> int:
    """The size of the file in bytes, derived from the underlying file."""
    return PyDaftFile._from_file_reference(self._inner, buffer_size=BUFFER_SNIFF).size()

to_tempfile #

to_tempfile(buffer_size: int = BUFFER_COPY) -> _TemporaryFileWrapper[bytes]

Create a temporary file with the contents of this file.

Returns:

Type	Description
`_TemporaryFileWrapper[bytes]`	_TemporaryFileWrapper[bytes]: The temporary file object.

The temporary file will be automatically deleted when the returned context manager is closed.

It's important to note that to_tempfile closes the original file object, so it CANNOT be used after calling this method.

Source code in daft/file/file.py

def to_tempfile(self, buffer_size: int = BUFFER_COPY) -> _TemporaryFileWrapper[bytes]:
    """Create a temporary file with the contents of this file.

    Returns:
        _TemporaryFileWrapper[bytes]: The temporary file object.

    The temporary file will be automatically deleted when the returned context manager is closed.

    It's important to note that `to_tempfile` closes the original file object, so it CANNOT be used after calling this method.
    """
    with self.open() as f:
        temp_file = tempfile.NamedTemporaryFile(
            prefix="daft_",
        )
        f.seek(0)

        size = f.size()
        # if its either a really small file, or doesn't support range requests. Just read it normally
        if not f._supports_range_requests() or size < 1024:
            temp_file.write(f.read())
        else:
            shutil.copyfileobj(f, temp_file, length=buffer_size)  # Default buffer size is 1MB
        # close it as `to_tempfile` is a consuming method
        f.close()
        temp_file.seek(0)

        return temp_file

writable #

writable() -> bool

Source code in daft/file/file.py

def writable(self) -> bool:
    return False

ImageFile #

ImageFile(url: str, io_config: IOConfig | None = None)

An image-specific file interface that provides image operations.

Methods:

Name	Description
`decode`	Decode the image file into a PIL Image.
`metadata`	Extract basic image metadata from file headers.

Source code in daft/file/image.py

def __init__(self, url: str, io_config: IOConfig | None = None) -> None:
    from daft.dependencies import pil_image

    if not pil_image.module_available():
        raise ImportError(
            "The 'pillow' module is required to create image files. "
            "Please install it with: pip install 'daft[image]'"
        )
    super().__init__(url, io_config, MediaType.image())

    if not self.is_image():
        raise ValueError(f"File {self} is not an image file")

decode #

decode(mode: str | None = None, buffer_size: int | None = BUFFER_COPY) -> Image

Decode the image file into a PIL Image.

Parameters:

Name	Type	Description	Default
`mode`	`str \| None`	Optional image mode to convert to (e.g. "RGB", "RGBA", "L").	`None`
`buffer_size`	`int \| None`	Read buffer size for full image decode. Defaults to 1 MiB.	`BUFFER_COPY`

Returns:

Type	Description
`Image`	PIL.Image.Image: The decoded image.

Source code in daft/file/image.py

def decode(self, mode: str | None = None, buffer_size: int | None = BUFFER_COPY) -> pil_image.Image:
    """Decode the image file into a PIL Image.

    Args:
        mode: Optional image mode to convert to (e.g. "RGB", "RGBA", "L").
        buffer_size: Read buffer size for full image decode. Defaults to 1 MiB.

    Returns:
        PIL.Image.Image: The decoded image.
    """
    with self.open(buffer_size=buffer_size) as f:
        img = pil_image.open(f)
        img.load()
        if mode is not None and img.mode != mode:
            img = img.convert(mode)
        return img

metadata #

metadata() -> ImageMetadata

Extract basic image metadata from file headers.

PIL's Image.open() is lazy -- it reads only the file header to determine dimensions, format, and mode without decoding pixel data.

Returns:

Name	Type	Description
`ImageMetadata`	`ImageMetadata`	Image metadata containing width, height, format, mode.

Source code in daft/file/image.py

def metadata(self) -> ImageMetadata:
    """Extract basic image metadata from file headers.

    PIL's Image.open() is lazy -- it reads only the file header to
    determine dimensions, format, and mode without decoding pixel data.

    Returns:
        ImageMetadata: Image metadata containing width, height, format, mode.
    """
    with self.open(buffer_size=BUFFER_METADATA) as f:
        img = pil_image.open(f)
        return ImageMetadata(
            width=img.width,
            height=img.height,
            format=img.format,
            mode=img.mode,
        )

AudioFile #

AudioFile(url: str, io_config: IOConfig | None = None)

An audio-specific file interface that provides audio operations.

Methods:

Name	Description
`metadata`	Extract basic audio metadata from container headers.
`resample`	Resample the audio file to the given sample rate.
`to_numpy`	Convert the audio file to a numpy array.

Source code in daft/file/audio.py

def __init__(self, url: str, io_config: IOConfig | None = None) -> None:
    if not sf.module_available():
        raise ImportError(
            "The 'soundfile' module is required to create audio files. "
            "Please add 'daft[audio]' to your dependencies or install it with: pip install 'daft[audio]'"
        )
    super().__init__(url, io_config, MediaType.audio())

    if not self.is_audio():
        raise ValueError(f"File {self} is not an audio file")

metadata #

metadata() -> AudioMetadata

Extract basic audio metadata from container headers.

Returns:

Name	Type	Description
`AudioMetadata`	`AudioMetadata`	Audio metadata object containing: - sample_rate: int - The sample rate of the audio file - channels: int - The number of channels in the audio file - frames: int - The number of frames in the audio file - format: str - The format of the audio file - subtype: str \| None - The subtype of the audio file

Source code in daft/file/audio.py

def metadata(self) -> AudioMetadata:
    """Extract basic audio metadata from container headers.

    Returns:
        AudioMetadata: Audio metadata object containing:
            - sample_rate: int - The sample rate of the audio file
            - channels: int - The number of channels in the audio file
            - frames: int - The number of frames in the audio file
            - format: str - The format of the audio file
            - subtype: str | None - The subtype of the audio file
    """
    with self.open(buffer_size=BUFFER_METADATA) as f, sf.SoundFile(f) as af:
        return AudioMetadata(
            sample_rate=af.samplerate,
            channels=af.channels,
            frames=af.frames,
            format=af.format,
            subtype=af.subtype,
        )

resample #

resample(sample_rate: int, buffer_size: int = BUFFER_COPY) -> ndarray[Any, dtype[float64]]

Resample the audio file to the given sample rate.

Parameters:

Name	Type	Description	Default
`sample_rate`	`int`	The new sample rate.	required
`buffer_size`	`int`	The buffer size to use for the temporary file.	`BUFFER_COPY`

Returns:

Name	Type	Description
`AudioFile`	`ndarray[Any, dtype[float64]]`	The resampled audio file.

Source code in daft/file/audio.py

def resample(self, sample_rate: int, buffer_size: int = BUFFER_COPY) -> np.ndarray[Any, np.dtype[np.float64]]:
    """Resample the audio file to the given sample rate.

    Args:
        sample_rate (int): The new sample rate.
        buffer_size (int): The buffer size to use for the temporary file.

    Returns:
        AudioFile: The resampled audio file.

    """
    if not librosa.module_available():
        raise ImportError(
            "The 'librosa' module is required to resample audio files. "
            "Please install it with: pip install 'daft[audio]'"
        )
    if not sf.module_available():
        raise ImportError(
            "The 'soundfile' module is required to resample audio files. "
            "Please install it with: pip install 'daft[audio]'"
        )
    with self.to_tempfile(buffer_size) as f:
        data, samplerate = sf.read(f)
        if samplerate != sample_rate:
            resampled_data = librosa.resample(data, orig_sr=samplerate, target_sr=sample_rate)
            return resampled_data
        else:
            return data

to_numpy #

to_numpy(buffer_size: int = BUFFER_COPY) -> ndarray[Any, dtype[float64]]

Convert the audio file to a numpy array.

Parameters:

Name	Type	Description	Default
`buffer_size`	`int`	The buffer size to use for the temporary file.	`BUFFER_COPY`

Returns:

Type	Description
`ndarray[Any, dtype[float64]]`	np.ndarray[Any, Any]: The audio data as a numpy array.

Source code in daft/file/audio.py

def to_numpy(self, buffer_size: int = BUFFER_COPY) -> np.ndarray[Any, np.dtype[np.float64]]:
    """Convert the audio file to a numpy array.

    Args:
        buffer_size (int): The buffer size to use for the temporary file.

    Returns:
        np.ndarray[Any, Any]: The audio data as a numpy array.

    """
    with self.to_tempfile(buffer_size) as tmp:
        audio, _ = sf.read(tmp)
        return audio

VideoFile #

VideoFile(url: str, io_config: IOConfig | None = None)

A video-specific file interface that provides video operations.

Methods:

Name	Description
`frames`	Lazy iterator of all decoded frames with metadata within time range.
`get_frame_by_idx`
`keyframes`	Lazy iterator of keyframes as PIL Images within time range.
`metadata`	Extract basic video metadata from container headers.

Source code in daft/file/video.py

def __init__(self, url: str, io_config: IOConfig | None = None) -> None:
    if not av.module_available():
        raise ImportError("The 'av' module is required to create video files.")
    if not pil_image.module_available():
        raise ImportError(
            "The 'pillow' module is required to create video files. Install it with `pip install daft[video]`."
        )
    super().__init__(url, io_config, MediaType.video())

    if not self.is_video():
        raise ValueError(f"File {self} is not a video file")

frames #

frames(start_time: float = 0, end_time: float | None = None, width: int | None = None, height: int | None = None, is_key_frame: bool | None = None, sample_interval_seconds: float | None = None, buffer_size: int = BUFFER_COPY) -> Iterator[VideoFrameData]

Lazy iterator of all decoded frames with metadata within time range.

Mirrors the per-frame schema of daft.read_video_frames().

Parameters:

Name	Type	Description	Default
`start_time`	`float`	Start of the time range in seconds. Defaults to 0.	`0`
`end_time`	`float \| None`	End of the time range in seconds. Defaults to None (end of video).	`None`
`width`	`int \| None`	Optional target width for resizing frames. Must be provided with `height`.	`None`
`height`	`int \| None`	Optional target height for resizing frames. Must be provided with `width`.	`None`
`is_key_frame`	`bool \| None`	If True, emit only keyframes. If False, emit only non-keyframes. If None, emit all decoded frames.	`None`
`sample_interval_seconds`	`float \| None`	If provided and > 0, sample frames at approximately this time interval in seconds based on `frame_time`. The algorithm picks the first frame whose timestamp is >= the next target time (`start_time`, `start_time + interval`, `start_time + 2*interval`, ...). Frames without valid timestamps are skipped. Same semantics as the source-side `daft.read_video_frames`.	`None`

Yields:

Type	Description
`VideoFrameData`	VideoFrameData dicts with keys: frame_index, frame_time, frame_time_base,
`VideoFrameData`	frame_pts, frame_dts, frame_duration, is_key_frame, data (PIL Image).

Source code in daft/file/video.py

def frames(
    self,
    start_time: float = 0,
    end_time: float | None = None,
    width: int | None = None,
    height: int | None = None,
    is_key_frame: bool | None = None,
    sample_interval_seconds: float | None = None,
    buffer_size: int = BUFFER_COPY,
) -> Iterator[VideoFrameData]:
    """Lazy iterator of all decoded frames with metadata within time range.

    Mirrors the per-frame schema of ``daft.read_video_frames()``.

    Args:
        start_time: Start of the time range in seconds. Defaults to 0.
        end_time: End of the time range in seconds. Defaults to None (end of video).
        width: Optional target width for resizing frames. Must be provided with ``height``.
        height: Optional target height for resizing frames. Must be provided with ``width``.
        is_key_frame: If True, emit only keyframes. If False, emit only non-keyframes.
            If None, emit all decoded frames.
        sample_interval_seconds: If provided and > 0, sample frames at approximately
            this time interval in seconds based on ``frame_time``. The algorithm picks
            the first frame whose timestamp is >= the next target time
            (``start_time``, ``start_time + interval``, ``start_time + 2*interval``, ...).
            Frames without valid timestamps are skipped. Same semantics as the
            source-side ``daft.read_video_frames``.

    Yields:
        VideoFrameData dicts with keys: frame_index, frame_time, frame_time_base,
        frame_pts, frame_dts, frame_duration, is_key_frame, data (PIL Image).
    """
    if not pil_image.module_available():
        raise ImportError(
            "The 'pillow' module is required for frame decoding. Install it with `pip install daft[video]`."
        )
    if (width is None) != (height is None):
        raise ValueError("Both width and height must be specified together for resizing.")
    if sample_interval_seconds is not None and sample_interval_seconds <= 0:
        raise ValueError("sample_interval_seconds must be positive if provided")
    with self.open(buffer_size=buffer_size) as f, av.open(f) as container:
        video = next(
            (stream for stream in container.streams if stream.type == "video"),
            None,
        )
        if video is None:
            raise ValueError("No video stream found")

        if is_key_frame:
            video.codec_context.skip_frame = "NONKEY"

        # Seek to start time
        if start_time > 0 and video.time_base:
            seek_timestamp = int(start_time / float(video.time_base))
            container.seek(seek_timestamp, stream=video)

        time_base = float(video.time_base) if video.time_base else None
        fps = float(video.average_rate) if video.average_rate else None
        if fps is None and video.guessed_rate:
            fps = float(video.guessed_rate)
        start_pts = video.start_time or 0

        # Sampling targets are start_time, start_time + interval, ... — same algorithm
        # as `daft.read_video_frames`. Epsilon absorbs float-precision drift between
        # the target time and the frame's PTS-derived `frame.time`.
        next_sample_time: float | None = float(start_time) if sample_interval_seconds is not None else None
        epsilon: float = 1e-9 if sample_interval_seconds is None else max(1e-9, sample_interval_seconds * 1e-6)

        frame_index: int = 0
        for frame in container.decode(video):
            # Skip frames before start_time (seek may land earlier)
            if frame.time is not None and frame.time < start_time:
                frame_index += 1
                continue

            # Stop at end_time
            if end_time is not None:
                if frame.time is not None and frame.time > end_time:
                    break

            if is_key_frame is False and frame.key_frame:
                frame_index += 1
                continue

            # Time-interval sampling: emit only when the frame has reached the next target.
            if sample_interval_seconds is not None:
                if frame.time is None:
                    frame_index += 1
                    continue
                assert next_sample_time is not None
                if frame.time + epsilon < next_sample_time:
                    frame_index += 1
                    continue
                # Advance past every target this frame already covers, so a long gap
                # between frames (VFR / large interval) doesn't queue up extra emits.
                while next_sample_time is not None and frame.time + epsilon >= next_sample_time:
                    next_sample_time += sample_interval_seconds

            # Resize if requested
            output_frame = frame
            if width is not None and height is not None:
                output_frame = frame.reformat(width=width, height=height)

            current_frame_index = frame_index
            if frame.pts is not None and time_base is not None and fps is not None:
                current_frame_index = int(round((frame.pts - start_pts) * time_base * fps))

            yield VideoFrameData(
                frame_index=current_frame_index,
                frame_time=frame.time,
                frame_time_base=str(frame.time_base) if frame.time_base else None,
                frame_pts=frame.pts,
                frame_dts=frame.dts,
                frame_duration=frame.duration,
                is_key_frame=frame.key_frame,
                data=output_frame.to_image(),
            )

            frame_index += 1

get_frame_by_idx #

get_frame_by_idx(idx: int, buffer_size: int = BUFFER_COPY) -> Image

Source code in daft/file/video.py

def get_frame_by_idx(self, idx: int, buffer_size: int = BUFFER_COPY) -> PIL.Image.Image:
    if not pil_image.module_available():
        raise ImportError(
            "The 'pillow' module is required for frame decoding. Install it with `pip install daft[video]`."
        )
    if idx < 0:
        raise IndexError(f"Frame index {idx} is out of range")

    with self.open(buffer_size=buffer_size) as f, av.open(f) as container:
        video = next(
            (stream for stream in container.streams if stream.type == "video"),
            None,
        )
        if video is None:
            raise ValueError("No video stream found")

        time_base = float(video.time_base) if video.time_base else None
        fps = float(video.average_rate) if video.average_rate else None
        if fps is None and video.guessed_rate:
            fps = float(video.guessed_rate)
        start_pts = video.start_time or 0

        # Seek to the nearest preceding keyframe at or before the target frame.
        if idx > 0 and time_base is not None and fps is not None:
            target_time = idx / fps
            seek_timestamp = int(target_time / time_base)
            container.seek(seek_timestamp, stream=video, backward=True)

        for frame_idx, frame in enumerate(container.decode(video)):
            current_frame_index = frame_idx
            if frame.pts is not None and time_base is not None and fps is not None:
                current_frame_index = int(round((frame.pts - start_pts) * time_base * fps))

            if current_frame_index == idx:
                return frame.to_image()
            if current_frame_index > idx:
                break

        raise IndexError(f"Frame index {idx} is out of range")

keyframes #

keyframes(start_time: float = 0, end_time: float | None = None) -> Iterator[Image]

Lazy iterator of keyframes as PIL Images within time range.

Source code in daft/file/video.py

def keyframes(self, start_time: float = 0, end_time: float | None = None) -> Iterator[PIL.Image.Image]:
    """Lazy iterator of keyframes as PIL Images within time range."""
    for frame in self.frames(start_time=start_time, end_time=end_time, is_key_frame=True):
        yield frame["data"]

metadata #

metadata(buffer_size: int = BUFFER_METADATA) -> VideoMetadata

Extract basic video metadata from container headers.

Returns:

Name	Type	Description
`VideoMetadata`	`VideoMetadata`	Video metadata object containing width, height, fps, frame_count, time_base, keyframe_pts, keyframe_indices

Source code in daft/file/video.py

def metadata(self, buffer_size: int = BUFFER_METADATA) -> VideoMetadata:
    """Extract basic video metadata from container headers.

    Returns:
        VideoMetadata: Video metadata object containing width, height, fps, frame_count, time_base, keyframe_pts, keyframe_indices

    """
    with self.open(buffer_size=buffer_size) as f, av.open(f, mode="r", metadata_encoding="utf-8") as container:
        video = next(
            (stream for stream in container.streams if stream.type == "video"),
            None,
        )
        if video is None:
            return VideoMetadata(
                width=None,
                height=None,
                fps=None,
                duration=None,
                frame_count=None,
                time_base=None,
            )

        # Basic stream properties ----------
        width = video.width
        height = video.height
        time_base = float(video.time_base) if video.time_base else None

        # Frame rate -----------------------
        fps = None
        if video.average_rate:
            fps = float(video.average_rate)
        elif video.guessed_rate:
            fps = float(video.guessed_rate)

        # Duration -------------------------
        duration = None
        if container.duration and container.duration > 0:
            duration = container.duration / 1_000_000.0
        elif video.duration:
            # Fallback time_base only for duration computation if missing
            tb_for_dur = float(video.time_base) if video.time_base else (1.0 / 1_000_000.0)
            duration = float(video.duration * tb_for_dur)

        # Frame count -----------------------
        frame_count = video.frames
        if not frame_count or frame_count <= 0:
            if duration and fps:
                frame_count = int(round(duration * fps))
            else:
                frame_count = None

        return VideoMetadata(
            width=width,
            height=height,
            fps=fps,
            duration=duration,
            frame_count=frame_count,
            time_base=time_base,
        )

Hdf5File #

Hdf5File(url: str, io_config: IOConfig | None = None)

Represents an HDF5 file backed by Daft file IO.

This class keeps File.open() as the inherited raw byte-stream API and provides HDF5-specific helpers that mirror common h5py File and Group operations. HDF5 access uses a smaller default file buffer than the generic File type because h5py performs frequent small reads after seeks while traversing metadata and chunk indexes.

Methods:

Name	Description
`attrs`	Return attributes attached to an HDF5 object.
`keys`	Return member names directly under an HDF5 group.
`metadata`	Collect object metadata below an HDF5 group.
`open`
`read`	Read one or more HDF5 datasets into NumPy arrays.
`visit`	Recursively visit object names below an HDF5 group.

Source code in daft/file/hdf5.py

def __init__(self, url: str, io_config: IOConfig | None = None) -> None:
    if not h5py.module_available():  # ty:ignore[unresolved-attribute]
        raise ImportError(
            "The 'daft[hdf5]' extra is required to read HDF5 files. "
            "Please install it with: pip install 'daft[hdf5]'"
        )
    if not np.module_available():  # type: ignore[attr-defined]  # ty:ignore[unresolved-attribute]
        raise ImportError(
            "The 'numpy' module is required to read HDF5 files and is included in the 'daft[hdf5]' extra. "
            "Please install it with: pip install 'daft[hdf5]'"
        )
    super().__init__(url, io_config, MediaType.hdf5())

    if not self.is_hdf5():
        raise ValueError(f"File {self} is not an HDF5 file")

attrs #

attrs(h5path: str = '/') -> dict[str, Any]

Return attributes attached to an HDF5 object.

Mirrors h5py's <object>.attrs dictionary-style interface and materializes the attributes as a plain Python dictionary.

Parameters:

Name	Type	Description	Default
`h5path`	`str`	Group or dataset path. Defaults to the root group `/`.	`'/'`

Returns:

Type	Description
`dict[str, Any]`	A dictionary of attribute names to values. Values follow h5py's
`dict[str, Any]`	normal conversion rules, such as NumPy scalars or arrays.

Source code in daft/file/hdf5.py

def attrs(self, h5path: str = "/") -> dict[str, Any]:
    """Return attributes attached to an HDF5 object.

    Mirrors h5py's ``<object>.attrs`` dictionary-style interface and
    materializes the attributes as a plain Python dictionary.

    Args:
        h5path: Group or dataset path. Defaults to the root group ``/``.

    Returns:
        A dictionary of attribute names to values. Values follow h5py's
        normal conversion rules, such as NumPy scalars or arrays.
    """
    with self._open_h5py(HDF5_SCAN_BUFFER_SIZE) as h5:
        return dict(h5[h5path].attrs)

keys #

keys(group: str = '/') -> list[str]

Return member names directly under an HDF5 group.

Mirrors h5py Group.keys(), but returns a concrete list[str] instead of a view object.

Parameters:

Name	Type	Description	Default
`group`	`str`	Group path whose immediate members should be listed. Defaults to the root group `/`.	`'/'`

Returns:

Type	Description
`list[str]`	Names of child groups and datasets directly under `group`.

Source code in daft/file/hdf5.py

def keys(self, group: str = "/") -> list[str]:
    """Return member names directly under an HDF5 group.

    Mirrors h5py ``Group.keys()``, but returns a concrete ``list[str]``
    instead of a view object.

    Args:
        group: Group path whose immediate members should be listed.
            Defaults to the root group ``/``.

    Returns:
        Names of child groups and datasets directly under ``group``.
    """
    with self._open_h5py(HDF5_SCAN_BUFFER_SIZE) as h5:
        node = h5[group]
        return list(node.keys())

metadata #

metadata(group: str = '/') -> list[Hdf5ObjectMetadata]

Collect object metadata below an HDF5 group.

This is a Daft convenience around the same recursive traversal used by h5py.File.visititems(). It visits groups and datasets under group and returns DataFrame-friendly dictionaries with stable keys.

Parameters:

Name	Type	Description	Default
`group`	`str`	Group path to traverse. Defaults to the root group `/`.	`'/'`

Returns:

Type	Description
`list[Hdf5ObjectMetadata]`	A list of metadata dictionaries containing `h5path`, `kind`,
`list[Hdf5ObjectMetadata]`	`shape`, `dtype`, `chunks`, and `compression`.

Raises:

Type	Description
`TypeError`	If `group` resolves to a dataset instead of a group.

Source code in daft/file/hdf5.py

def metadata(self, group: str = "/") -> list[Hdf5ObjectMetadata]:
    """Collect object metadata below an HDF5 group.

    This is a Daft convenience around the same recursive traversal used by
    ``h5py.File.visititems()``. It visits groups and datasets under
    ``group`` and returns DataFrame-friendly dictionaries with stable keys.

    Args:
        group: Group path to traverse. Defaults to the root group ``/``.

    Returns:
        A list of metadata dictionaries containing ``h5path``, ``kind``,
        ``shape``, ``dtype``, ``chunks``, and ``compression``.

    Raises:
        TypeError: If ``group`` resolves to a dataset instead of a group.
    """
    with self._open_h5py(HDF5_SCAN_BUFFER_SIZE) as h5:
        node = h5[group]
        if not hasattr(node, "visititems"):
            raise TypeError(f"{group} is not an HDF5 group")

        objects: list[Hdf5ObjectMetadata] = []

        def collect(name: str, obj: Any) -> None:
            if hasattr(obj, "shape") and hasattr(obj, "dtype"):
                objects.append(
                    Hdf5ObjectMetadata(
                        h5path=_join_h5path(group, name),
                        kind="dataset",
                        shape=list(obj.shape),
                        dtype=str(obj.dtype),
                        chunks=list(obj.chunks) if obj.chunks is not None else [],
                        compression=obj.compression or "",
                    )
                )
            else:
                objects.append(
                    Hdf5ObjectMetadata(
                        h5path=_join_h5path(group, name),
                        kind="group",
                        shape=[],
                        dtype="",
                        chunks=[],
                        compression="",
                    )
                )

        node.visititems(collect)
        return objects

open #

open(buffer_size: int | None = HDF5_DEFAULT_BUFFER_SIZE) -> PyDaftFile

Source code in daft/file/hdf5.py

def open(self, buffer_size: int | None = HDF5_DEFAULT_BUFFER_SIZE) -> PyDaftFile:
    return super().open(buffer_size=buffer_size)

read #

read(dataset: str) -> ndarray[Any, Any]

read(dataset: list[str] | tuple[str, ...]) -> dict[str, ndarray[Any, Any]]

read(dataset: str | Sequence[str]) -> ndarray[Any, Any] | dict[str, ndarray[Any, Any]]

Read one or more HDF5 datasets into NumPy arrays.

For a single dataset path, this is equivalent to opening the file with h5py and evaluating h5[dataset][()]. Passing a sequence reads multiple datasets with one file open.

Parameters:

Name	Type	Description	Default
`dataset`	`str \| Sequence[str]`	A dataset path or sequence of dataset paths.	required

Returns:

Type	Description
`ndarray[Any, Any] \| dict[str, ndarray[Any, Any]]`	A NumPy array for one dataset. For multiple datasets, a dictionary
`ndarray[Any, Any] \| dict[str, ndarray[Any, Any]]`	keyed by dataset path.

Raises:

Type	Description
`TypeError`	If any requested path resolves to a group instead of a dataset.

Source code in daft/file/hdf5.py

def read(
    self,
    dataset: str | Sequence[str],
) -> np.ndarray[Any, Any] | dict[str, np.ndarray[Any, Any]]:
    """Read one or more HDF5 datasets into NumPy arrays.

    For a single dataset path, this is equivalent to opening the file with
    h5py and evaluating ``h5[dataset][()]``. Passing a sequence reads
    multiple datasets with one file open.

    Args:
        dataset: A dataset path or sequence of dataset paths.

    Returns:
        A NumPy array for one dataset. For multiple datasets, a dictionary
        keyed by dataset path.

    Raises:
        TypeError: If any requested path resolves to a group instead of a
            dataset.
    """
    if isinstance(dataset, Mapping):
        raise TypeError("Hdf5File.read() does not support alias mappings; pass a dataset path or sequence.")

    with self._open_h5py() as h5:
        if isinstance(dataset, str):
            return self._read_dataset(h5, dataset)
        return {h5path: self._read_dataset(h5, h5path) for h5path in dataset}

visit #

visit(*, group: str = '/') -> list[str]

visit(func: Callable[[str], Any], *, group: str = '/') -> Any

visit(func: Callable[[str], Any] | None = None, *, group: str = '/') -> Any

Recursively visit object names below an HDF5 group.

Thin wrapper around h5py Group.visit. When func is provided, it is called once per visited object name. Returning None from func continues traversal; returning any other value stops traversal and returns that value.

If func is omitted, this method collects and returns all visited names as a list. This matches the common h5py pattern of passing names.append as the visitor.

Parameters:

Name	Type	Description	Default
`func`	`Callable[[str], Any] \| None`	Optional visitor callable with signature `func(name)`.	`None`
`group`	`str`	Group path where traversal should start. Defaults to `/`.	`'/'`

Returns:

Type	Description
`Any`	The visitor's first non-`None` return value, or `None` if the
`Any`	visitor completed without one. If `func` is omitted, returns
`Any`	`list[str]`.

Source code in daft/file/hdf5.py

def visit(self, func: Callable[[str], Any] | None = None, *, group: str = "/") -> Any:
    """Recursively visit object names below an HDF5 group.

    Thin wrapper around h5py ``Group.visit``. When ``func`` is provided, it
    is called once per visited object name. Returning ``None`` from
    ``func`` continues traversal; returning any other value stops traversal
    and returns that value.

    If ``func`` is omitted, this method collects and returns all visited
    names as a list. This matches the common h5py pattern of passing
    ``names.append`` as the visitor.

    Args:
        func: Optional visitor callable with signature ``func(name)``.
        group: Group path where traversal should start. Defaults to ``/``.

    Returns:
        The visitor's first non-``None`` return value, or ``None`` if the
        visitor completed without one. If ``func`` is omitted, returns
        ``list[str]``.
    """
    with self._open_h5py(HDF5_SCAN_BUFFER_SIZE) as h5:
        node = h5[group]
        if func is None:
            names: list[str] = []
            node.visit(names.append)
            return names
        return node.visit(func)