Skip to content

Datasets#

Daft provides simple, performant, and responsible ways to access useful datasets like Common Crawl.

Common Crawl#

Check out our Common Crawl dataset guide for more examples!

common_crawl #

common_crawl(crawl: str, segment: str | None = None, content: Literal['raw', 'text', 'metadata', 'warc', 'wet', 'wat'] = 'raw', num_files: int | None = None, io_config: IOConfig | None = None, *, in_aws: bool) -> DataFrame

Load Common Crawl data as a DataFrame.

This function automatically resolves the specified crawl and segment into the appropriate Common Crawl files and loads them as a DataFrame, handling the WARC reading process internally.

Parameters:

Name Type Description Default
crawl str

The crawl identifier, e.g. "CC-MAIN-2025-33".

required
segment str | None

Specific segment to fetch within the crawl. If not provided, defaults to all segments in the crawl.

None
content Literal['raw', 'text', 'metadata', 'warc', 'wet', 'wat']

Specifies the type of content to load. Options are: + "raw" or "warc": Raw WARC files containing full HTTP responses + "text" or "wet": Extracted plain text content + "metadata" or "wat": Metadata about crawled pages

'raw'
num_files int | None

Limit the number of files to process. If not provided, processes all matching files.

None
io_config IOConfig | None

IO configuration for accessing S3.

None
in_aws bool

Where to fetch the common crawl data from. If running in AWS, this must be set to True. If outside of AWS, then this must be set to False. Setting this flag correctly is required for optimal download performance. If running in AWS, then make sure you're in the "us-east-1" region so you don't incur S3 egress fees! See the Common Crawl docs for more specific instructions.

required

Returns:

Type Description
DataFrame

A DataFrame containing the requested Common Crawl data.

Examples:

1
2
>>> # Create a dataframe from raw WARC data from a specific crawl
>>> daft.datasets.common_crawl("CC-MAIN-2025-33", in_aws=True)
╭────────────────┬─────────────────┬───────────┬────────────────────┬────────────┬────────────────────┬──────────────┬──────────────╮
│ WARC-Record-ID ┆ WARC-Target-URI ┆ WARC-Type ┆ WARC-Date          ┆      …     ┆ WARC-Identified-Pa ┆ warc_content ┆ warc_headers │
│ ---            ┆ ---             ┆ ---       ┆ ---                ┆            ┆ yload-Type         ┆ ---          ┆ ---          │
│ String         ┆ String          ┆ String    ┆ Timestamp[ns,      ┆ (1 hidden) ┆ ---                ┆ Binary       ┆ String       │
│                ┆                 ┆           ┆ "Etc/UTC"]         ┆            ┆ String             ┆              ┆              │
╰────────────────┴─────────────────┴───────────┴────────────────────┴────────────┴────────────────────┴──────────────┴──────────────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
1
2
>>> # Show a sample of extracted text content
>>> daft.datasets.common_crawl("CC-MAIN-2025-33", content="text", in_aws=True).limit(2).show()
╭─────────────────┬─────────────────┬────────────┬─────────────────┬────────────┬─────────────────┬────────────────┬────────────────╮
│ WARC-Record-ID  ┆ WARC-Target-URI ┆ WARC-Type  ┆ WARC-Date       ┆      …     ┆ WARC-Identified ┆ warc_content   ┆ warc_headers   │
│ ---             ┆ ---             ┆ ---        ┆ ---             ┆            ┆ -Payload-Type   ┆ ---            ┆ ---            │
│ String          ┆ String          ┆ String     ┆ Timestamp[ns    ┆ (1 hidden) ┆ ---             ┆ Binary         ┆ String         │
│                 ┆                 ┆            ┆ "Etc/UTC"]      ┆            ┆ String          ┆                ┆                │
╞═════════════════╪═════════════════╪════════════╪═════════════════╪════════════╪═════════════════╪════════════════╪════════════════╡
│ 0cb039e8-d357-4 ┆ None            ┆ warcinfo   ┆ 2025-08-16      ┆ …          ┆ None            ┆ b"Software-Inf ┆ {"Content-Type │
│ 85f-95dd-cdfdb… ┆                 ┆            ┆ 01:03:20 UTC    ┆            ┆                 ┆ o:             ┆ ":"application │
│                 ┆                 ┆            ┆                 ┆            ┆                 ┆ ia-web-commo…  ┆ /…             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ af55e6ef-eeda-4 ┆ http://010ganji ┆ conversion ┆ 2025-08-02      ┆ …          ┆ None            ┆ b"ETF\xe9\x80\ ┆ {"Content-Type │
│ bf7-a599-581bc… ┆ .com/html/ying… ┆            ┆ 23:06:24 UTC    ┆            ┆                 ┆ x89\xe6\x8b\xa ┆ ":"text/plain" │
│                 ┆                 ┆            ┆                 ┆            ┆                 ┆ 9…             ┆ ,…             │
╰─────────────────┴─────────────────┴────────────┴─────────────────┴────────────┴─────────────────┴────────────────┴────────────────╯
(Showing first 2 of 2 rows)
1
2
3
4
5
6
>>> # Sample a single file from a specific segment in a crawl for testing
>>> (
...     daft.datasets.common_crawl("CC-MAIN-2025-33", segment="1754151579063.98", num_files=1, in_aws=True)
...     .limit(2)
...     .show()
... )
╭─────────────────┬─────────────────┬───────────┬─────────────────┬────────────┬─────────────────┬─────────────────┬────────────────╮
│ WARC-Record-ID  ┆ WARC-Target-URI ┆ WARC-Type ┆ WARC-Date       ┆      …     ┆ WARC-Identified ┆ warc_content    ┆ warc_headers   │
│ ---             ┆ ---             ┆ ---       ┆ ---             ┆            ┆ -Payload-Type   ┆ ---             ┆ ---            │
│ String          ┆ String          ┆ String    ┆ Timestamp[ns    ┆ (1 hidden) ┆ ---             ┆ Binary          ┆ String         │
│                 ┆                 ┆           ┆ "Etc/UTC"]      ┆            ┆ String          ┆                 ┆                │
╞═════════════════╪═════════════════╪═══════════╪═════════════════╪════════════╪═════════════════╪═════════════════╪════════════════╡
│ b6238b9c-8db0-4 ┆ None            ┆ warcinfo  ┆ 2025-08-15      ┆ …          ┆ None            ┆ b"isPartOf: CC- ┆ {"Content-Type │
│ 5ac-a6ef-c3cb0… ┆                 ┆           ┆ 20:42:38 UTC    ┆            ┆                 ┆ MAIN-2025-33\r… ┆ ":"application │
│                 ┆                 ┆           ┆                 ┆            ┆                 ┆                 ┆ /…             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b29da11b-5976-4 ┆ http://0.woxav. ┆ request   ┆ 2025-08-15      ┆ …          ┆ None            ┆ b"GET /forum-12 ┆ {"Content-Type │
│ f3b-82c4-71fdd… ┆ com/forum-120-… ┆           ┆ 22:33:40 UTC    ┆            ┆                 ┆ 0-1.html HTTP/… ┆ ":"application │
│                 ┆                 ┆           ┆                 ┆            ┆                 ┆                 ┆ /…             │
╰─────────────────┴─────────────────┴───────────┴─────────────────┴────────────┴─────────────────┴─────────────────┴────────────────╯
(Showing first 2 of 2 rows)
Source code in daft/datasets/common_crawl.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
def common_crawl(
    crawl: str,
    segment: str | None = None,
    content: Literal["raw", "text", "metadata", "warc", "wet", "wat"] = "raw",
    num_files: int | None = None,
    io_config: IOConfig | None = None,
    *,
    in_aws: bool,
) -> DataFrame:
    r"""Load Common Crawl data as a DataFrame.

    This function automatically resolves the specified crawl and segment into the appropriate Common Crawl files
    and loads them as a DataFrame, handling the WARC reading process internally.

    Args:
        crawl: The crawl identifier, e.g. "CC-MAIN-2025-33".
        segment: Specific segment to fetch within the crawl. If not provided, defaults to all segments in the crawl.
        content: Specifies the type of content to load. Options are:
            + "raw" or "warc": Raw WARC files containing full HTTP responses
            + "text" or "wet": Extracted plain text content
            + "metadata" or "wat": Metadata about crawled pages
        num_files: Limit the number of files to process. If not provided, processes all matching files.
        io_config: IO configuration for accessing S3.
        in_aws: Where to fetch the common crawl data from. If running in AWS, this must be set to True. If outside of AWS,
                then this must be set to False. Setting this flag correctly is required for **optimal download performance**.
                If running in AWS, then make sure you're in the "us-east-1" region so you don't incur S3 egress fees!
                See [the Common Crawl docs](https://commoncrawl.org/get-started) for more specific instructions.

    Returns:
        A DataFrame containing the requested Common Crawl data.

    Examples:
        >>> # Create a dataframe from raw WARC data from a specific crawl
        >>> daft.datasets.common_crawl("CC-MAIN-2025-33", in_aws=True)  # doctest: +SKIP
        ╭────────────────┬─────────────────┬───────────┬────────────────────┬────────────┬────────────────────┬──────────────┬──────────────╮
        │ WARC-Record-ID ┆ WARC-Target-URI ┆ WARC-Type ┆ WARC-Date          ┆      …     ┆ WARC-Identified-Pa ┆ warc_content ┆ warc_headers │
        │ ---            ┆ ---             ┆ ---       ┆ ---                ┆            ┆ yload-Type         ┆ ---          ┆ ---          │
        │ String         ┆ String          ┆ String    ┆ Timestamp[ns,      ┆ (1 hidden) ┆ ---                ┆ Binary       ┆ String       │
        │                ┆                 ┆           ┆ "Etc/UTC"]         ┆            ┆ String             ┆              ┆              │
        ╰────────────────┴─────────────────┴───────────┴────────────────────┴────────────┴────────────────────┴──────────────┴──────────────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)

        >>> # Show a sample of extracted text content
        >>> daft.datasets.common_crawl("CC-MAIN-2025-33", content="text", in_aws=True).limit(2).show()  # doctest: +SKIP
        ╭─────────────────┬─────────────────┬────────────┬─────────────────┬────────────┬─────────────────┬────────────────┬────────────────╮
        │ WARC-Record-ID  ┆ WARC-Target-URI ┆ WARC-Type  ┆ WARC-Date       ┆      …     ┆ WARC-Identified ┆ warc_content   ┆ warc_headers   │
        │ ---             ┆ ---             ┆ ---        ┆ ---             ┆            ┆ -Payload-Type   ┆ ---            ┆ ---            │
        │ String          ┆ String          ┆ String     ┆ Timestamp[ns    ┆ (1 hidden) ┆ ---             ┆ Binary         ┆ String         │
        │                 ┆                 ┆            ┆ "Etc/UTC"]      ┆            ┆ String          ┆                ┆                │
        ╞═════════════════╪═════════════════╪════════════╪═════════════════╪════════════╪═════════════════╪════════════════╪════════════════╡
        │ 0cb039e8-d357-4 ┆ None            ┆ warcinfo   ┆ 2025-08-16      ┆ …          ┆ None            ┆ b"Software-Inf ┆ {"Content-Type │
        │ 85f-95dd-cdfdb… ┆                 ┆            ┆ 01:03:20 UTC    ┆            ┆                 ┆ o:             ┆ ":"application │
        │                 ┆                 ┆            ┆                 ┆            ┆                 ┆ ia-web-commo…  ┆ /…             │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ af55e6ef-eeda-4 ┆ http://010ganji ┆ conversion ┆ 2025-08-02      ┆ …          ┆ None            ┆ b"ETF\xe9\x80\ ┆ {"Content-Type │
        │ bf7-a599-581bc… ┆ .com/html/ying… ┆            ┆ 23:06:24 UTC    ┆            ┆                 ┆ x89\xe6\x8b\xa ┆ ":"text/plain" │
        │                 ┆                 ┆            ┆                 ┆            ┆                 ┆ 9…             ┆ ,…             │
        ╰─────────────────┴─────────────────┴────────────┴─────────────────┴────────────┴─────────────────┴────────────────┴────────────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

        >>> # Sample a single file from a specific segment in a crawl for testing
        >>> (
        ...     daft.datasets.common_crawl("CC-MAIN-2025-33", segment="1754151579063.98", num_files=1, in_aws=True)
        ...     .limit(2)
        ...     .show()
        ... )  # doctest: +SKIP
        ╭─────────────────┬─────────────────┬───────────┬─────────────────┬────────────┬─────────────────┬─────────────────┬────────────────╮
        │ WARC-Record-ID  ┆ WARC-Target-URI ┆ WARC-Type ┆ WARC-Date       ┆      …     ┆ WARC-Identified ┆ warc_content    ┆ warc_headers   │
        │ ---             ┆ ---             ┆ ---       ┆ ---             ┆            ┆ -Payload-Type   ┆ ---             ┆ ---            │
        │ String          ┆ String          ┆ String    ┆ Timestamp[ns    ┆ (1 hidden) ┆ ---             ┆ Binary          ┆ String         │
        │                 ┆                 ┆           ┆ "Etc/UTC"]      ┆            ┆ String          ┆                 ┆                │
        ╞═════════════════╪═════════════════╪═══════════╪═════════════════╪════════════╪═════════════════╪═════════════════╪════════════════╡
        │ b6238b9c-8db0-4 ┆ None            ┆ warcinfo  ┆ 2025-08-15      ┆ …          ┆ None            ┆ b"isPartOf: CC- ┆ {"Content-Type │
        │ 5ac-a6ef-c3cb0… ┆                 ┆           ┆ 20:42:38 UTC    ┆            ┆                 ┆ MAIN-2025-33\r… ┆ ":"application │
        │                 ┆                 ┆           ┆                 ┆            ┆                 ┆                 ┆ /…             │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ b29da11b-5976-4 ┆ http://0.woxav. ┆ request   ┆ 2025-08-15      ┆ …          ┆ None            ┆ b"GET /forum-12 ┆ {"Content-Type │
        │ f3b-82c4-71fdd… ┆ com/forum-120-… ┆           ┆ 22:33:40 UTC    ┆            ┆                 ┆ 0-1.html HTTP/… ┆ ":"application │
        │                 ┆                 ┆           ┆                 ┆            ┆                 ┆                 ┆ /…             │
        ╰─────────────────┴─────────────────┴───────────┴─────────────────┴────────────┴─────────────────┴─────────────────┴────────────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    if num_files is not None and num_files <= 0:
        raise ValueError("num_files must be a positive integer")

    content_type_map: dict[str, Literal["warc", "wet", "wat"]] = {
        "raw": "warc",
        "text": "wet",
        "metadata": "wat",
        "warc": "warc",
        "wet": "wet",
        "wat": "wat",
    }
    if content not in content_type_map:
        raise ValueError(f"Invalid content type for daft.datasets.common_crawl: {content}")
    file_type = content_type_map[content]

    warc_paths = _get_common_crawl_paths(
        crawl=crawl,
        segment=segment,
        file_type=file_type,
        num_files=num_files,
        io_config=io_config,
        in_aws=in_aws,
    )

    return read_warc(warc_paths, io_config=io_config)