Reading from Apache Hudi#
Apache Hudi is an open-sourced transactional data lake platform that brings database and data warehouse capabilities to data lakes. Hudi supports transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.
Daft currently supports:
-
Parallel + Distributed Reads: Daft parallelizes Hudi table reads over all cores of your machine, if using the default multithreading runner, or all cores + machines of your Ray cluster, if using the distributed Ray runner.
-
Skipping Filtered Data: Daft ensures that only data that matches your
df.where()filter will be read, often skipping entire files/partitions. -
Multi-cloud Support: Daft supports reading Hudi tables from AWS S3, Azure Blob Store, and GCS, as well as local files.
A detailed Apache Hudi roadmap for Daft can be found on our GitHub Issues page. For the overall Daft development plan, see Daft Roadmap.
Installing Daft with Apache Hudi Support#
Daft supports installing Hudi through optional dependency.
1 | |
Reading a Table#
To read from an Apache Hudi table, use the daft.read_hudi() function. The following is an example snippet of loading an example table into Daft:
1 2 3 4 5 6 | |
Currently there are limitations of reading Hudi tables:
- Only support snapshot read of Copy-on-Write tables
- Only support reading table version 5 & 6 (tables created using release 0.12.x - 0.15.x)
- Table must not have
hoodie.datasource.write.drop.partition.columns=true
Type System#
Daft and Hudi have compatible type systems. Here are how types are converted across the two systems.
When reading from a Hudi table into Daft:
| Apachi Hudi | Daft |
|---|---|
| Primitive Types | |
boolean | daft.DataType.bool() |
byte | daft.DataType.int8() |
short | daft.DataType.int16() |
int | daft.DataType.int32() |
long | daft.DataType.int64() |
float | daft.DataType.float32() |
double | daft.DataType.float64() |
decimal(precision, scale) | daft.DataType.decimal128(precision, scale) |
date | daft.DataType.date() |
timestamp | daft.DataType.timestamp(timeunit="us", timezone=None) |
timestampz | daft.DataType.timestamp(timeunit="us", timezone="UTC") |
string | daft.DataType.string() |
binary | daft.DataType.binary() |
| Nested Types | |
struct(fields) | daft.DataType.struct(fields) |
list(child_type) | daft.DataType.list(child_type) |
map(K, V) | daft.DataType.struct({"key": K, "value": V}) |