Hugging Face Datasets#

Daft has native support for reading from and writing to Hugging Face datasets.

To install all dependencies required for Daft's Hugging Face integrations, use the huggingface feature:

1	`pip install 'daft[huggingface]'`

Reading From a Dataset#

Daft is able to read datasets directly from Hugging Face using the daft.read_huggingface() function or via the hf://datasets/ protocol.

Reading an Entire Dataset#

Using daft.read_huggingface(), you can easily read a Hugging Face dataset.

🐍 Python

import daft

df = daft.read_huggingface("username/dataset_name")

This will read the entire dataset into a DataFrame.

Warning

This is currently limited to either public datasets, or PRO/ENTERPRISE datasets, where Hugging Face will automatically convert the dataset to Parquet.

For other datasets, you will need to manually specify the path or glob pattern to the files you want to read, similar to how you would read from a local file system. See the next section for an example.

Reading Specific Files#

Not only can you read entire datasets, but you can also read individual files from a dataset. Using a read function that takes in a path (such as daft.read_parquet(), daft.read_csv(), or daft.read_json()), specify a Hugging Face dataset path via the hf://datasets/ prefix:

🐍 Python

import daft

# read a specific Parquet file
df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet")

# or a csv file
df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv")

# or a set of Parquet files using a glob pattern
df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet")

Writing to a Dataset#

Daft is able to write Parquet files to Hugging Face datasets using DataFrame.write_huggingface. Daft supports Content-Defined Chunking and Xet for faster, deduplicated writes.

Basic usage:

🐍 Python

import daft

df: daft.DataFrame = ...

df.write_huggingface("username/dataset_name")

See the DataFrame.write_huggingface API page for more info.

Configuring Writes#

DataFrame.write_huggingface accepts an IOConfig which can be used to configure the write behavior. Here's an example of how to use it:

🐍 Python

import daft
from daft.io import IOConfig, HuggingFaceConfig

df: daft.DataFrame = ...

io_config = IOConfig(hf=HuggingFaceConfig(
    use_content_defined_chunking=...,
    row_group_size=...,
    target_filesize=...,
    max_operations_per_commit=...,
))

df.write_huggingface("username/dataset_name", io_config=io_config)

See the HuggingFaceConfig API page for more information about each argument.

Authentication#

The token parameter in HuggingFaceConfig can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private datasets or writing to a dataset).

Example of reading a dataset with a specified token:

🐍 Python

from daft.io import IOConfig, HuggingFaceConfig

io_config = IOConfig(hf=HuggingFaceConfig(token="your_token"))
df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config)