Hugging Face Datasets#
Daft has native support for reading from and writing to Hugging Face datasets.
To install all dependencies required for Daft's Hugging Face integrations, use the huggingface feature:
1 | |
Reading From a Dataset#
Daft is able to read datasets directly from Hugging Face using the daft.read_huggingface() function or via the hf://datasets/ protocol.
Reading an Entire Dataset#
Using daft.read_huggingface(), you can easily read a Hugging Face dataset.
1 2 3 | |
This will read the entire dataset into a DataFrame.
Warning
This is currently limited to either public datasets, or PRO/ENTERPRISE datasets, where Hugging Face will automatically convert the dataset to Parquet.
For other datasets, you will need to manually specify the path or glob pattern to the files you want to read, similar to how you would read from a local file system. See the next section for an example.
Reading Specific Files#
Not only can you read entire datasets, but you can also read individual files from a dataset. Using a read function that takes in a path (such as daft.read_parquet(), daft.read_csv(), or daft.read_json()), specify a Hugging Face dataset path via the hf://datasets/ prefix:
1 2 3 4 5 6 7 8 9 10 | |
Writing to a Dataset#
Daft is able to write Parquet files to Hugging Face datasets using DataFrame.write_huggingface. Daft supports Content-Defined Chunking and Xet for faster, deduplicated writes.
Basic usage:
1 2 3 4 5 | |
See the DataFrame.write_huggingface API page for more info.
Configuring Writes#
DataFrame.write_huggingface accepts an IOConfig which can be used to configure the write behavior. Here's an example of how to use it:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
See the HuggingFaceConfig API page for more information about each argument.
Authentication#
The token parameter in HuggingFaceConfig can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private datasets or writing to a dataset).
Example of reading a dataset with a specified token:
1 2 3 4 | |