Reading from and Writing to Lance#
Lance is a next-generation columnar storage format for multimodal datasets (images, video, audio, and general columnar data). It supports local POSIX filesystems and cloud object stores (e.g., S3/GCS). Lance is known for extremely fast random access, zero-copy reads, deep integration with PyArrow/DuckDB, and strong performance for vector retrieval workloads.
Daft currently supports:
-
Parallel and distributed reads: Daft parallelizes reads on the default multithreading runner or on the distributed Ray runner
-
Skipping filtered data (Data Skipping): Daft uses
df.where()predicates and file/fragment statistics to skip non-matching data -
Multi-cloud and local access: Read from S3, GCS, Azure Blob Storage, and local filesystems with unified IO configuration
-
Version/time-slice reads: Use
versionandasofparameters to read a specific version or the latest version as of a given timestamp -
Scan optimization: Configure
fragment_group_sizeto group fragments and improve scan efficiency -
Cache tuning: Configure
index_cache_sizeandmetadata_cache_size_bytesto optimize index page caching and metadata retrieval for large datasets -
REST API support: Connect to Lance tables managed by REST-compliant services like LanceDB Cloud, Apache Gravitino, and other catalog systems using the Lance REST Namespace specification
Installing Daft with Lance Support#
Daft integrates Lance through an optional dependency:
1 | |
For REST API support, you'll also need the Lance namespace packages:
1 | |
Reading a Table#
Use daft.read_lance to read a Lance dataset. You can pass either a local path, a cloud object store URI, or a REST endpoint.
File-based Reading#
1 2 3 4 5 6 7 8 | |
REST-based Reading#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
To access public S3/GCS buckets, configure IO options for authentication and endpoints:
1 2 3 4 5 6 7 8 9 10 11 | |
For S3-compatible services (e.g. Volcengine TOS), configure IO options for authentication and endpoints:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
For datasets with many fragments, group fragments to reduce scheduling and metadata overhead:
1 2 | |
Filter operations on the Daft df DataFrame object will be pushed down to the Lance data source for efficient data skipping.
1 2 3 4 5 | |
Filtering with Custom SQL Expressions#
Daft supports passing raw SQL filter strings directly to the Lance scanner via default_scan_options. This allows you to leverage Lance's native SQL capabilities, including GeoSpatial functions, which might not be directly expressible in Daft's DataFrame API yet.
You can also use this mechanism to perform SQL-based projections (calculations) during the scan.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Writing to Lance#
Use df.write_lance() to write a DataFrame to a Lance dataset. Supported modes include create, append, and overwrite. You can write to both file-based and REST-based Lance tables.
File-based Writing#
1 2 3 4 5 6 7 8 9 10 11 12 | |
REST-based Writing#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
For S3-compatible services (e.g. Volcengine TOS), configure IO options for authentication and endpoints:#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
Note: the {region} should match the region of your TOS bucket. e.g. if your bucket is in cn-beijing, you should set region_name="cn-beijing" and endpoint_url="https://tos-s3-cn-beijing.ivolces.com".
Writing with a specific schema#
You can provide a pyarrow.Schema to control the on-disk Lance schema. Daft will align the DataFrame to this schema (column order, types, and nullability) before writing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
This ensures that the resulting Lance table uses the exact schema you specify, even if the in-memory DataFrame has compatible but different types or column ordering.
Write Schema Control
- If
schemais not provided, Daft uses the current DataFrame schema. - If a
pyarrow.Schemais provided, data will be aligned to that schema before writing (type/order/nullability). - If the target dataset already exists and the write is not an overwrite, data is converted to the existing table schema for compatibility.
REST Configuration and Catalog Management#
When working with Lance tables via REST APIs, you can configure authentication and manage table catalogs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | |
Supported REST Services#
The Lance REST implementation is compatible with any service that implements the Lance REST Namespace specification, including:
- LanceDB Cloud: Managed Lance service with enterprise features
- LanceDB Enterprise: Self-hosted enterprise deployment
- Apache Gravitino: Multi-modal metadata service with Lance support
- Custom implementations: Any service implementing the Lance REST Namespace OpenAPI spec
Advanced Usage#
Vector Search#
Daft can push Lance's vector search options through default_scan_options. This lets you express nearest-neighbor queries at scan time while keeping the rest of your pipeline in Daft.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Data Evolution#
If you need to add derived columns in-place to an existing Lance dataset (e.g., apply a UDF across batches and persist the result), use daft.io.lance.merge_columns:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Compaction#
Compaction is the process of rewriting a Lance dataset to optimize its structure for query performance. This operation can:
- Merge small files: Combine multiple small data fragments into fewer, larger fragments to reduce metadata overhead and improve read throughput.
- Materialize deletions: Physically remove rows that have been marked for deletion, which can reduce storage and accelerate scans.
- Improve data layout: Reorganize data to improve compression and read efficiency.
Use compaction when a dataset has undergone many small appends, has a high number of deleted rows, or contains a large number of small files.
Daft provides a distributed implementation of compaction through [daft.io.lance.compact_files][].
1 2 3 4 5 6 7 8 9 10 11 | |
uri: Path to the Lance dataset.compaction_options: A dictionary of options to control compaction behavior (e.g.,target_rows_per_fragment,materialize_deletions), see Lance documentation for more details.- Returns: A
CompactionMetricsobject with statistics if compaction was performed, orNoneif no action was needed. TheCompactionMetricsobject contains the following fields:fragments_removed: Number of fragments removed during compaction.fragments_added: Number of fragments added during compaction.files_removed: Number of files removed during compaction.files_added: Number of files added during compaction.
partition_num: On the Ray Runner, this controls the number of parallel compaction tasks. Defaults to 1. On the native runner, this option is ignored.
This example compacts a dataset with multiple fragments into a single, larger fragment, and uses partition_num to control the number of parallel compaction tasks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |