Skip to content

Creating File References with from_files#

Daft provides [daft.from_files()][daft.io.from_files] to create a DataFrame of lazy file references from glob patterns. Unlike other read functions that immediately load file contents, from_files creates File objects that can be read on demand.

Basic Usage#

1
2
3
4
import daft

df = daft.from_files("/path/to/files/*.jpeg")
df.show()
1
2
3
4
import daft

df = daft.from_files("s3://my-bucket/images/*.png")
df.show()
1
2
3
4
import daft

df = daft.from_files("gs://my-bucket/images/*.png")
df.show()

Output Schema#

The from_files function returns a DataFrame with a single column:

Column Type Description
file File Lazy file references that can be read on demand

Wildcard Patterns#

from_files supports standard glob patterns:

Pattern Description
* Matches any number of characters
? Matches any single character
[...] Matches any single character in the brackets
** Recursively matches directories
1
2
3
4
5
6
7
8
# All JPEG files in a directory
df = daft.from_files("/images/*.jpeg")

# Recursive search
df = daft.from_files("/images/**/*.png")

# Multiple patterns
df = daft.from_files(["/images/*.jpeg", "/photos/*.jpeg"])

Working with File Objects#

The File type is a lazy reference that provides access to file metadata and content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import daft
from daft import col

df = daft.from_files("/path/to/files/*")

# Access file properties
df = df.select(
    col("file").file_path().alias("path"),
    col("file").file_size().alias("size_bytes"),
)
df.show()

Use Cases#

Image Processing Pipeline#

For image processing, use daft.from_glob_path() with .download() and decode_image():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import daft
from daft import col
from daft.functions import decode_image

df = (
    daft.from_glob_path("s3://bucket/images/**/*.jpg")
    .with_column("image_bytes", col("path").download())
    .with_column("image", decode_image(col("image_bytes")))
)
df.show()

Batch File Operations#

1
2
3
4
5
6
7
8
9
import daft
from daft import col

# Get file references with metadata
df = daft.from_files("/data/**/*")

# Filter by file properties before reading content
large_files = df.where(col("file").file_size() > 1_000_000)
large_files.show()

Comparison with from_glob_path#

from_files is similar to daft.from_glob_path() but returns File objects instead of path strings:

Function Returns Use Case
from_glob_path path column (string) When you need file paths only
from_files file column (File) When you need to read file content or access file properties
1
2
3
4
5
# from_glob_path returns paths
paths_df = daft.from_glob_path("/images/*.jpg")  # Column: path (string)

# from_files returns File objects
files_df = daft.from_files("/images/*.jpg")  # Column: file (File)

Empty Results#

If no files match the glob pattern(s), an empty DataFrame is returned instead of raising an error:

1
2
df = daft.from_files("/nonexistent/*.txt")
df.show()  # Empty DataFrame with "file" column