Schema#
Daft can display your DataFrame's schema without materializing it. Under the hood, it performs intelligent sampling of your data to determine the appropriate schema, and if you make any modifications to your DataFrame it can infer the resulting types based on the operation.
Schema #
Schema()
Methods:
| Name | Description |
|---|---|
apply_hints | Applies hints from another schema to this schema. |
column_names | Returns a list of the names of the columns in the schema. |
display_with_metadata | Returns a string representation of the schema, optionally including metadata. |
estimate_row_size_bytes | Estimates the size of a row in bytes based on the schema. |
from_csv | Creates a Schema from a CSV file. |
from_field_name_and_types | Creates a Daft Schema from a list of field name and types. |
from_json | Creates a Schema from a JSON file. |
from_parquet | Creates a Schema from a Parquet file. |
from_pyarrow_schema | Creates a Daft Schema from a PyArrow Schema. |
from_pydict | Creates a Schema from a dictionary of field names and their corresponding DataTypes. |
min_estimated_size_column | Returns the name of the column with the minimum estimated size. |
to_name_set | Returns a set of column names in the schema. |
to_pyarrow_schema | Converts a Daft Schema to a PyArrow Schema. |
union | Creates a new Schema that is the union of this schema and another schema. |
Source code in daft/schema.py
66 67 | |
apply_hints #
Applies hints from another schema to this schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hints | Schema | Schema containing hints to apply to this schema. | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A new Schema with the hints applied. |
Source code in daft/schema.py
196 197 198 199 200 201 202 203 204 205 | |
column_names #
column_names() -> list[str]
Returns a list of the names of the columns in the schema.
Returns:
| Type | Description |
|---|---|
list[str] | list[str]: List of column names in the schema. |
Source code in daft/schema.py
145 146 147 148 149 150 151 | |
display_with_metadata #
display_with_metadata(include_metadata: bool = False) -> str
Returns a string representation of the schema, optionally including metadata.
Source code in daft/schema.py
179 180 181 | |
estimate_row_size_bytes #
estimate_row_size_bytes() -> float
Estimates the size of a row in bytes based on the schema.
Returns:
| Name | Type | Description |
|---|---|---|
float | float | Estimated size of a row in bytes. |
Source code in daft/schema.py
153 154 155 156 157 158 159 | |
from_csv #
from_csv(path: str, parse_options: CsvParseOptions | None = None, io_config: IOConfig | None = None, multithreaded_io: bool | None = None) -> Schema
Creates a Schema from a CSV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | Path to the CSV file. | required |
parse_options | CsvParseOptions | None | Options for parsing the CSV file. | None |
io_config | IOConfig | None | IO configuration for reading the file. | None |
multithreaded_io | bool | None | Whether to use multithreaded IO. | None |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object representing the CSV file. |
Source code in daft/schema.py
265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
from_field_name_and_types #
Creates a Daft Schema from a list of field name and types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields | list[tuple[str, DataType]] | List of field name and types | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | Daft schema with the provided field names and types |
Source code in daft/schema.py
94 95 96 97 98 99 100 101 102 103 104 | |
from_json #
from_json(path: str, parse_options: JsonParseOptions | None = None, io_config: IOConfig | None = None, multithreaded_io: bool | None = None) -> Schema
Creates a Schema from a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | Path to the JSON file. | required |
parse_options | JsonParseOptions | None | Options for parsing the JSON file. | None |
io_config | IOConfig | None | IO configuration for reading the file. | None |
multithreaded_io | bool | None | Whether to use multithreaded IO. | None |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object representing the JSON file. |
Source code in daft/schema.py
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 | |
from_parquet #
from_parquet(path: str, io_config: IOConfig | None = None, multithreaded_io: bool | None = None, coerce_int96_timestamp_unit: TimeUnit = ns()) -> Schema
Creates a Schema from a Parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | Path to the Parquet file. | required |
io_config | IOConfig | None | IO configuration for reading the file. | None |
multithreaded_io | bool | None | Whether to use multithreaded IO. | None |
coerce_int96_timestamp_unit | TimeUnit | The time unit to coerce INT96 timestamps to. | ns() |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object representing the Parquet file. |
Source code in daft/schema.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |
from_pyarrow_schema #
from_pyarrow_schema(pa_schema: Schema) -> Schema
Creates a Daft Schema from a PyArrow Schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pa_schema | Schema | PyArrow schema to convert | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | Converted Daft schema |
Source code in daft/schema.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
from_pydict #
Creates a Schema from a dictionary of field names and their corresponding DataTypes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields | dict[str, DataType] | Dictionary mapping field names to DataTypes. | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object created from the provided fields. |
Source code in daft/schema.py
225 226 227 228 229 230 231 232 233 234 235 | |
min_estimated_size_column #
min_estimated_size_column() -> str | None
Returns the name of the column with the minimum estimated size.
Source code in daft/schema.py
183 184 185 | |
to_name_set #
to_name_set() -> set[str]
Returns a set of column names in the schema.
Returns:
| Type | Description |
|---|---|
set[str] | set[str]: Set of column names in the schema. |
Source code in daft/schema.py
168 169 170 171 172 173 174 | |
to_pyarrow_schema #
to_pyarrow_schema() -> Schema
Converts a Daft Schema to a PyArrow Schema.
Returns:
| Type | Description |
|---|---|
Schema | pa.Schema: PyArrow schema that corresponds to the provided Daft schema |
Source code in daft/schema.py
106 107 108 109 110 111 112 113 | |
union #
Creates a new Schema that is the union of this schema and another schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Schema | The schema to union with this schema. | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A new Schema that is the union of this schema and the other schema. |
Source code in daft/schema.py
208 209 210 211 212 213 214 215 216 217 218 219 220 | |
Schema has been moved to daft.schema but is still accessible at daft.logical.schema.
Schema #
Schema()
Methods:
| Name | Description |
|---|---|
apply_hints | Applies hints from another schema to this schema. |
column_names | Returns a list of the names of the columns in the schema. |
display_with_metadata | Returns a string representation of the schema, optionally including metadata. |
estimate_row_size_bytes | Estimates the size of a row in bytes based on the schema. |
from_csv | Creates a Schema from a CSV file. |
from_field_name_and_types | Creates a Daft Schema from a list of field name and types. |
from_json | Creates a Schema from a JSON file. |
from_parquet | Creates a Schema from a Parquet file. |
from_pyarrow_schema | Creates a Daft Schema from a PyArrow Schema. |
from_pydict | Creates a Schema from a dictionary of field names and their corresponding DataTypes. |
min_estimated_size_column | Returns the name of the column with the minimum estimated size. |
to_name_set | Returns a set of column names in the schema. |
to_pyarrow_schema | Converts a Daft Schema to a PyArrow Schema. |
union | Creates a new Schema that is the union of this schema and another schema. |
Source code in daft/schema.py
66 67 | |
apply_hints #
Applies hints from another schema to this schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hints | Schema | Schema containing hints to apply to this schema. | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A new Schema with the hints applied. |
Source code in daft/schema.py
196 197 198 199 200 201 202 203 204 205 | |
column_names #
column_names() -> list[str]
Returns a list of the names of the columns in the schema.
Returns:
| Type | Description |
|---|---|
list[str] | list[str]: List of column names in the schema. |
Source code in daft/schema.py
145 146 147 148 149 150 151 | |
display_with_metadata #
display_with_metadata(include_metadata: bool = False) -> str
Returns a string representation of the schema, optionally including metadata.
Source code in daft/schema.py
179 180 181 | |
estimate_row_size_bytes #
estimate_row_size_bytes() -> float
Estimates the size of a row in bytes based on the schema.
Returns:
| Name | Type | Description |
|---|---|---|
float | float | Estimated size of a row in bytes. |
Source code in daft/schema.py
153 154 155 156 157 158 159 | |
from_csv #
from_csv(path: str, parse_options: CsvParseOptions | None = None, io_config: IOConfig | None = None, multithreaded_io: bool | None = None) -> Schema
Creates a Schema from a CSV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | Path to the CSV file. | required |
parse_options | CsvParseOptions | None | Options for parsing the CSV file. | None |
io_config | IOConfig | None | IO configuration for reading the file. | None |
multithreaded_io | bool | None | Whether to use multithreaded IO. | None |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object representing the CSV file. |
Source code in daft/schema.py
265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
from_field_name_and_types #
Creates a Daft Schema from a list of field name and types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields | list[tuple[str, DataType]] | List of field name and types | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | Daft schema with the provided field names and types |
Source code in daft/schema.py
94 95 96 97 98 99 100 101 102 103 104 | |
from_json #
from_json(path: str, parse_options: JsonParseOptions | None = None, io_config: IOConfig | None = None, multithreaded_io: bool | None = None) -> Schema
Creates a Schema from a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | Path to the JSON file. | required |
parse_options | JsonParseOptions | None | Options for parsing the JSON file. | None |
io_config | IOConfig | None | IO configuration for reading the file. | None |
multithreaded_io | bool | None | Whether to use multithreaded IO. | None |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object representing the JSON file. |
Source code in daft/schema.py
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 | |
from_parquet #
from_parquet(path: str, io_config: IOConfig | None = None, multithreaded_io: bool | None = None, coerce_int96_timestamp_unit: TimeUnit = ns()) -> Schema
Creates a Schema from a Parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | Path to the Parquet file. | required |
io_config | IOConfig | None | IO configuration for reading the file. | None |
multithreaded_io | bool | None | Whether to use multithreaded IO. | None |
coerce_int96_timestamp_unit | TimeUnit | The time unit to coerce INT96 timestamps to. | ns() |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object representing the Parquet file. |
Source code in daft/schema.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |
from_pyarrow_schema #
from_pyarrow_schema(pa_schema: Schema) -> Schema
Creates a Daft Schema from a PyArrow Schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pa_schema | Schema | PyArrow schema to convert | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | Converted Daft schema |
Source code in daft/schema.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
from_pydict #
Creates a Schema from a dictionary of field names and their corresponding DataTypes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields | dict[str, DataType] | Dictionary mapping field names to DataTypes. | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A Schema object created from the provided fields. |
Source code in daft/schema.py
225 226 227 228 229 230 231 232 233 234 235 | |
min_estimated_size_column #
min_estimated_size_column() -> str | None
Returns the name of the column with the minimum estimated size.
Source code in daft/schema.py
183 184 185 | |
to_name_set #
to_name_set() -> set[str]
Returns a set of column names in the schema.
Returns:
| Type | Description |
|---|---|
set[str] | set[str]: Set of column names in the schema. |
Source code in daft/schema.py
168 169 170 171 172 173 174 | |
to_pyarrow_schema #
to_pyarrow_schema() -> Schema
Converts a Daft Schema to a PyArrow Schema.
Returns:
| Type | Description |
|---|---|
Schema | pa.Schema: PyArrow schema that corresponds to the provided Daft schema |
Source code in daft/schema.py
106 107 108 109 110 111 112 113 | |
union #
Creates a new Schema that is the union of this schema and another schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Schema | The schema to union with this schema. | required |
Returns:
| Name | Type | Description |
|---|---|---|
Schema | Schema | A new Schema that is the union of this schema and the other schema. |
Source code in daft/schema.py
208 209 210 211 212 213 214 215 216 217 218 219 220 | |