Reading from and Writing to Apache Paimon#
Apache Paimon is an open-source lakehouse storage format designed for high-throughput streaming and batch analytics. It supports ACID transactions, primary-key tables with upserts (via an LSM-tree merge engine), append-only tables, and flexible partition strategies — making it a popular choice on top of object stores like HDFS, OSS, and S3.
Daft integrates with Paimon through pypaimon, the official Apache Paimon Python SDK.
Daft currently supports:
- Distributed Reads: Daft distributes I/O across all available compute resources (Ray workers or local threads).
- Predicate & Partition Pushdown:
df.where()filter expressions are pushed down to Paimon's scan planner for partition pruning and file-level skipping. - Column Projection: Only the requested columns are read from disk.
- Append-only and Primary-Key Tables: Both table types are supported; append-only tables use Daft's native high-performance Parquet reader, while PK tables that require LSM merge fall back to pypaimon's built-in reader.
- Catalog Abstraction: Paimon catalogs integrate with Daft's unified
Catalog/Tableinterfaces, enabling SQL queries anddaft.read_table()access.
Installation#
1 2 | |
For S3 / S3-compatible storage (e.g. MinIO, Ceph), also install the AWS extra:
1 | |
OSS (Alibaba Cloud Object Storage) is supported via Daft's built-in OpenDAL backend — no extra is required beyond daft and pypaimon.
Tutorial#
Reading a Table#
Use [daft.read_paimon][] to create a DataFrame from a Paimon table. First, obtain a table object through pypaimon:
1 2 3 4 | |
Then create a DataFrame:
1 2 3 4 | |
Filter operations are pushed down to Paimon's scan planner for efficient partition pruning:
1 2 3 | |
For tables on object stores, pass an IOConfig to supply credentials:
1 2 3 4 5 6 7 8 9 10 11 | |
Writing to a Table#
Use df.write_paimon() to write a DataFrame back to Paimon. Two modes are supported:
Append#
1 2 | |
The returned DataFrame summarises the written files:
1 2 3 4 5 6 7 | |
Overwrite#
1 2 | |
Using the Catalog Abstraction#
Daft's Catalog and Table interfaces let you access Paimon tables through the same API used by Iceberg, Unity, Glue, and other built-in integrations.
Creating a Catalog#
1 2 3 4 5 | |
You can also wrap a single pypaimon table directly:
1 2 3 4 | |
Browsing the Catalog#
1 2 3 4 5 6 7 8 9 10 11 | |
Reading and Writing Through the Catalog#
1 2 3 4 5 6 7 8 9 10 11 | |
Creating Tables#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
To create a primary-key table, pass primary_keys in the properties dict:
1 2 3 4 5 6 | |
Session and SQL Integration#
Once attached to a Session, your Paimon catalog is available from SQL queries and daft.read_table():
1 2 3 4 5 6 7 8 9 10 11 | |
Type System#
Paimon types are mapped through PyArrow to Daft types:
| Paimon | Daft |
|---|---|
BOOLEAN | daft.DataType.bool() |
TINYINT | daft.DataType.int8() |
SMALLINT | daft.DataType.int16() |
INT | daft.DataType.int32() |
BIGINT | daft.DataType.int64() |
FLOAT | daft.DataType.float32() |
DOUBLE | daft.DataType.float64() |
DECIMAL(precision, scale) | daft.DataType.decimal128(precision, scale) |
DATE | daft.DataType.date() |
TIME(precision) | daft.DataType.int64() |
TIMESTAMP(precision) | daft.DataType.timestamp(timeunit=..., timezone=None) |
TIMESTAMP_LTZ(precision) | daft.DataType.timestamp(timeunit=..., timezone="UTC") |
CHAR(n) / VARCHAR(n) / STRING | daft.DataType.string() |
BINARY(n) / VARBINARY(n) / BYTES | daft.DataType.binary() |
ARRAY<element_type> | daft.DataType.list(element_type) |
MAP<key_type, value_type> | daft.DataType.map(key_type, value_type) |
ROW<[field_name: field_type]> | daft.DataType.struct(fields) |
FAQs#
-
Do I need to install pypaimon separately? Yes. Run
pip install pypaimon. pypaimon is not bundled with Daft. -
Which Paimon catalog types are supported? pypaimon currently ships two catalog implementations:
FileSystemCatalog(local / OSS / S3, selected bymetastore=filesystemor by default) andRESTCatalog(selected bymetastore=rest). Both can be wrapped withCatalog.from_paimon(). -
Can I use Daft with an existing Paimon warehouse? Yes. Create a pypaimon catalog pointing at your warehouse and pass it to
Catalog.from_paimon()or usecatalog.get_table()anddaft.read_paimon()directly. -
Does Daft support schema evolution for Paimon tables? Reading tables with evolved schemas is handled by pypaimon's reader. Daft does not currently expose DDL operators beyond
create_table. -
How do I configure credentials for OSS or S3? Pass an
IOConfigtodaft.read_paimon(), or include the relevantfs.*options in the catalog options dict when creating the pypaimon catalog (Daft will infer anIOConfigautomatically from these). -
Can I use the Daft Paimon connector inside a Ray cluster? Yes. Daft's distributed execution on Ray works with Paimon the same way it does with Iceberg — scan tasks are distributed across workers automatically.