Connectors#
Daft offers a variety of approaches to reading from and writing to various data sources (in-memory data, files, data catalogs, and integrations). Please see Daft Connectors API docs for API details.
In-Memory#
| Function | Description |
|---|---|
from_arrow | Create a DataFrame from PyArrow Tables or RecordBatches |
from_dask_dataframe | Create a DataFrame from a Dask DataFrame |
from_pandas | Create a DataFrame from a Pandas DataFrame |
from_pydict | Create a DataFrame from a python dictionary |
from_pylist | Create a DataFrame from a python list |
from_ray_dataset | Create a DataFrame from a Ray Dataset |
Files#
| Function | Description |
|---|---|
[from_files][daft.io.from_files] | Create a DataFrame of lazy file references from a glob pattern |
from_glob_path | Create a DataFrame of file paths from a glob pattern |
See also Files for detailed usage.
Cloud Storage#
Daft natively supports reading and writing data to major cloud storage providers:
| Provider | URL Protocols | Configuration |
|---|---|---|
| AWS S3 | s3:// | S3Config |
| Azure Blob Storage | az://, abfs:// | AzureConfig |
| Google Cloud Storage | gs://, gcs:// | GCSConfig |
| Tencent Cloud COS | cos://, cosn:// | CosConfig |
CSV#
| Function | Description |
|---|---|
read_csv | Read a CSV file or multiple CSV files into a DataFrame |
write_csv | Write a DataFrame to CSV files |
Delta Lake#
| Function | Description |
|---|---|
read_deltalake | Read a Delta Lake table into a DataFrame |
write_deltalake | Write a DataFrame to a Delta Lake table |
See also Delta Lake for detailed integration.
Hudi#
| Function | Description |
|---|---|
read_hudi | Read a Hudi table into a DataFrame |
See also Apache Hudi for detailed integration.
Iceberg#
| Function | Description |
|---|---|
read_iceberg | Read an Iceberg table into a DataFrame |
write_iceberg | Write a DataFrame to an Iceberg table |
See also Iceberg for detailed integration.
Paimon#
| Function | Description |
|---|---|
[read_paimon][daft.io.read_paimon] | Read a Paimon table into a DataFrame |
write_paimon | Write a DataFrame to a Paimon table |
See also Apache Paimon for detailed integration.
JSON#
| Function | Description |
|---|---|
read_json | Read a JSON file or multiple JSON files into a DataFrame |
write_json | Write a DataFrame to JSON files |
Kafka#
Experimental
This connector is experimental. Currently only bounded batch reads are supported — there is no streaming/unbounded mode and no offset commit management.
| Function | Description |
|---|---|
read_kafka | Read messages from Kafka topic(s) into a DataFrame |
See also Kafka for detailed integration.
Lance#
| Function | Description |
|---|---|
read_lance | Read a Lance dataset into a DataFrame |
write_lance | Write a DataFrame to a Lance dataset |
See also Lance for detailed integration.
Parquet#
| Function | Description |
|---|---|
read_parquet | Read a Parquet file or multiple Parquet files into a DataFrame |
write_parquet | Write a DataFrame to Parquet files |
PostgreSQL#
| Function | Description |
|---|---|
Catalog.from_postgres | Create a catalog from a PostgreSQL database |
See also PostgreSQL for detailed integration.
MCAP#
Experimental
This connector is experimental. See MCAP for details.
| Function | Description |
|---|---|
[read_mcap][daft.io.read_mcap] | Read MCAP files into a DataFrame |
See also MCAP for detailed integration.
SQL#
| Function | Description |
|---|---|
read_sql | Read data from a SQL database into a DataFrame |
write_sql | Write a DataFrame to a SQL database |
Text#
| Function | Description |
|---|---|
[read_text][daft.io.read_text] | Read text files into a DataFrame |
See also Text for detailed usage.
Video#
| Function | Description |
|---|---|
read_video_frames | Read video frames into a DataFrame |
WARC#
| Function | Description |
|---|---|
read_warc | Read a WARC file or multiple WARC files into a DataFrame |
Bigtable#
Experimental
This connector is experimental and the API may change.
| Function | Description |
|---|---|
write_bigtable | Write a DataFrame to Google Cloud Bigtable |
See also Bigtable for detailed integration.
ClickHouse#
| Function | Description |
|---|---|
write_clickhouse | Write a DataFrame to ClickHouse |
See also ClickHouse for detailed integration.
User-Defined#
| Function | Description |
|---|---|
DataSink | Interface for writing data from DataFrames |
DataSource | Interface for reading data into DataFrames |
DataSourceTask | Represents a partition of data that can be processed independently |
WriteResult | Wrapper for intermediate results written by a DataSink |
write_sink | Write a DataFrame to the given DataSink |
Daft Catalogs#
Warning
These APIs are early in their development. Please feel free to open feature requests and file issues. We'd love to hear what you would like, thank you! 🤘
Daft also provides APIs to work with catalogs. Catalogs are a centralized place to organize and govern your data. It is often responsible for creating objects such as tables and namespaces, managing transactions, and access control. Most importantly, the catalog abstracts away physical storage details, letting you focus on the logical structure of your data without worrying about file formats, partitioning schemes, or storage locations.
Daft integrates with various catalog implementations using its Catalog and Table interfaces. These are high-level APIs to manage catalog objects (tables and namespaces), while also making it easy to leverage Daft's existing daft.read_ and df.write_ APIs for open table formats like Iceberg and Delta Lake.
Example#
Note
These examples use the Iceberg Catalog from the Daft Sessions tutorial.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
Usage#
This section covers detailed usage of the current APIs with some code snippets.
Working with Catalogs#
The Catalog interface allows you to perform catalog actions like get_table and list_tables.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | |
Working with Tables#
The Table interface is a bridge from catalogs to dataframes. We can read tables into dataframes, and we can write dataframes to tables. You can work with a table independently of a catalog by using one of the factory methods, but it might not appear to provide that much utility over the existing daft.read_ and daft.write_ APIs. You would be correct in assuming that this is what is happening under the hood! The Table interface provides indirection over the table format itself and serves as a single abstraction for reading and writing that our catalogs can work with.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Note
Today you can read from pyiceberg and daft.unity table objects.
Reference#
Note
For complete documentation, please see the Catalog & Table API docs.
- Catalog - Interface for creating and accessing both tables and namespaces
- Identifier - Paths to objects e.g.
catalog.namespace.table - Table - Interface for reading and writing dataframes