Skip to content

Writing to Google Cloud Bigtable#

Experimental

This connector is experimental and the API may change in future releases.

Google Cloud Bigtable is a fully managed, scalable NoSQL database service. Daft can write DataFrames to Bigtable tables using df.write_bigtable().

Installing Dependencies#

Bigtable support requires the google-cloud-bigtable package:

1
pip install google-cloud-bigtable

Basic Usage#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import daft

# Create a DataFrame
df = daft.from_pydict({
    "user_id": ["user_001", "user_002", "user_003"],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 35],
    "email": ["alice@example.com", "bob@example.com", "charlie@example.com"],
})

# Write to Bigtable
result = df.write_bigtable(
    project_id="my-gcp-project",
    instance_id="my-bigtable-instance",
    table_id="users",
    row_key_column="user_id",
    column_family_mappings={
        "name": "profile",
        "age": "profile",
        "email": "contact",
    },
)

Key Concepts#

Row Keys#

Every Bigtable row requires a unique row key. Use row_key_column to specify which DataFrame column should be used as the row key:

1
2
3
4
df.write_bigtable(
    ...,
    row_key_column="user_id",  # This column becomes the row key
)

Column Families#

Bigtable organizes columns into column families. Use column_family_mappings to specify which family each column belongs to:

1
2
3
4
5
6
7
8
df.write_bigtable(
    ...,
    column_family_mappings={
        "name": "user_data",      # 'name' column goes to 'user_data' family
        "age": "user_data",       # 'age' column goes to 'user_data' family
        "email": "contact_info",  # 'email' column goes to 'contact_info' family
    },
)

The column families must already exist in the Bigtable table.

Parameters#

Parameter Type Required Description
project_id str Yes Google Cloud project ID
instance_id str Yes Bigtable instance ID
table_id str Yes Bigtable table ID
row_key_column str Yes Column name to use as the row key
column_family_mappings dict[str, str] Yes Mapping of column names to column families
client_kwargs dict No Additional arguments for the Bigtable Client
write_kwargs dict No Additional arguments for MutationsBatcher
serialize_incompatible_types bool No Auto-convert incompatible types to JSON (default: True)

Data Type Handling#

Bigtable cells only accept data that can be converted to bytes. By default, Daft automatically serializes incompatible types to JSON:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
df = daft.from_pydict({
    "id": ["row1"],
    "data": [{"nested": "object"}],  # Complex type
})

# Complex types are automatically serialized to JSON
df.write_bigtable(
    ...,
    serialize_incompatible_types=True,  # Default behavior
)

To disable automatic serialization (will raise an error for incompatible types):

1
2
3
4
df.write_bigtable(
    ...,
    serialize_incompatible_types=False,
)

Advanced Configuration#

Client Options#

Pass additional options to the Bigtable client:

1
2
3
4
5
6
7
result = df.write_bigtable(
    ...,
    client_kwargs={
        "admin": True,
        "channel": custom_channel,
    },
)

Write Options#

Configure the MutationsBatcher for write operations:

1
2
3
4
5
6
7
result = df.write_bigtable(
    ...,
    write_kwargs={
        "flush_count": 1000,
        "max_row_bytes": 5 * 1024 * 1024,  # 5MB
    },
)

Use Cases#

IoT Data Storage#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import daft
from daft import col

# Read sensor data
df = daft.read_parquet("s3://bucket/sensors/*.parquet")

# Prepare for Bigtable (create composite row key)
df = df.with_column(
    "row_key",
    col("device_id").concat("#").concat(col("timestamp").cast(str)),
)

# Write to Bigtable
df.write_bigtable(
    project_id="iot-project",
    instance_id="sensor-data",
    table_id="readings",
    row_key_column="row_key",
    column_family_mappings={
        "temperature": "metrics",
        "humidity": "metrics",
        "device_id": "metadata",
        "timestamp": "metadata",
    },
)

User Profile Storage#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import daft

df = daft.from_pydict({
    "user_id": ["u001", "u002"],
    "preferences": [{"theme": "dark"}, {"theme": "light"}],
    "last_login": ["2024-01-15", "2024-01-16"],
})

df.write_bigtable(
    project_id="my-project",
    instance_id="user-store",
    table_id="profiles",
    row_key_column="user_id",
    column_family_mappings={
        "preferences": "settings",
        "last_login": "activity",
    },
)

Notes#

  • The Bigtable table and column families must exist before writing
  • Row keys should be designed carefully for efficient access patterns
  • Consider Bigtable's row key design best practices for optimal performance