Reading from and Writing to Google Cloud Storage#
Daft is able to read/write data to/from Google Cloud Storage (GCS), and understands natively the URL protocols gs:// and gcs:// as referring to data that resides in GCS.
Authorization/Authentication#
In GCS, data is stored under the hierarchy of:
- Project: The Google Cloud project that owns the storage resources.
- Bucket: The container for data storage.
- Object Key: The unique identifier for a piece of data within a bucket.
URLs to data in GCS come in the form: gs://{BUCKET}/{OBJECT_KEY}.
Rely on Environment#
You can configure Application Default Credentials (ADC) to have Daft automatically discover credentials. Common methods include:
- Setting the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to point to a service account key file - Running
gcloud auth application-default loginfor local development - Using the default service account when running on Google Cloud (GCE, GKE, Cloud Run, etc.)
Please be aware that when doing so in a distributed environment such as Ray, Daft will pick these credentials up from worker machines and thus each worker machine needs to be appropriately provisioned.
If instead you wish to have Daft use credentials from the "driver", you may wish to manually specify your credentials.
Manually specify credentials#
You may also choose to pass these values into your Daft I/O function calls using a daft.io.GCSConfig config object.
daft.set_planning_config is a convenient way to set your daft.io.IOConfig as the default config to use on any subsequent Daft method calls.
1 2 3 4 5 6 7 8 9 10 | |
1 2 3 4 5 6 7 | |
1 2 3 4 5 6 7 | |
Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the io_config= keyword argument. This is extremely flexible as you can pass a different daft.io.GCSConfig per function call if you wish!
1 2 3 4 5 6 | |
Configuration Options#
The daft.io.GCSConfig object supports the following options:
| Parameter | Type | Description |
|---|---|---|
project_id | str | Google Cloud project ID |
credentials | str | Path to service account JSON key file |
token | str | OAuth2 access token |
anonymous | bool | Whether to use anonymous access (for public buckets) |
max_connections | int | Maximum number of concurrent connections |
retry_initial_backoff_ms | int | Initial backoff time in milliseconds for retries |
connect_timeout_ms | int | Connection timeout in milliseconds |
read_timeout_ms | int | Read timeout in milliseconds |
num_tries | int | Number of retry attempts |