Creating File References with from_files
Daft provides [daft.from_files()][daft.io.from_files] to create a DataFrame of lazy file references from glob patterns. Unlike other read functions that immediately load file contents, from_files creates File objects that can be read on demand.
Basic Usage
Output Schema
The from_files function returns a DataFrame with a single column:
| Column | Type | Description |
file | File | Lazy file references that can be read on demand |
Wildcard Patterns
from_files supports standard glob patterns:
| Pattern | Description |
* | Matches any number of characters |
? | Matches any single character |
[...] | Matches any single character in the brackets |
** | Recursively matches directories |
Working with File Objects
The File type is a lazy reference that provides access to file metadata and content:
| import daft
from daft import col
df = daft.from_files("/path/to/files/*")
# Access file properties
df = df.select(
col("file").file_path().alias("path"),
col("file").file_size().alias("size_bytes"),
)
df.show()
|
Use Cases
Image Processing Pipeline
For image processing, use daft.from_glob_path() with .download() and decode_image():
| import daft
from daft import col
from daft.functions import decode_image
df = (
daft.from_glob_path("s3://bucket/images/**/*.jpg")
.with_column("image_bytes", col("path").download())
.with_column("image", decode_image(col("image_bytes")))
)
df.show()
|
Batch File Operations
| import daft
from daft import col
# Get file references with metadata
df = daft.from_files("/data/**/*")
# Filter by file properties before reading content
large_files = df.where(col("file").file_size() > 1_000_000)
large_files.show()
|
Comparison with from_glob_path
from_files is similar to daft.from_glob_path() but returns File objects instead of path strings:
| Function | Returns | Use Case |
from_glob_path | path column (string) | When you need file paths only |
from_files | file column (File) | When you need to read file content or access file properties |
| # from_glob_path returns paths
paths_df = daft.from_glob_path("/images/*.jpg") # Column: path (string)
# from_files returns File objects
files_df = daft.from_files("/images/*.jpg") # Column: file (File)
|
Empty Results
If no files match the glob pattern(s), an empty DataFrame is returned instead of raising an error:
| df = daft.from_files("/nonexistent/*.txt")
df.show() # Empty DataFrame with "file" column
|