Reading from and Writing to Tencent Cloud COS#
Daft supports reading and writing data to Tencent Cloud COS (Cloud Object Storage), and understands natively the URL protocols cos:// and cosn:// (Hadoop CosN compatible) as referring to data that resides in COS.
Authorization/Authentication#
In Tencent Cloud COS, data is stored under the hierarchy of:
- Bucket: The container for data storage, which is the top-level namespace for data storage in COS.
- Object Key: The unique identifier for a piece of data within a bucket.
URLs to data in COS come in the form: cos://{BUCKET}/{OBJECT_KEY}.
Hadoop CosN Compatibility
Daft also supports the cosn:// URL scheme for compatibility with Hadoop CosN. Both cos:// and cosn:// are treated identically.
Rely on Environment#
You can configure Daft to automatically discover credentials from environment variables. Daft supports the following environment variable prefixes:
| Environment Variable | Description |
|---|---|
COS_ENDPOINT | Endpoint of the COS service |
COS_REGION or TENCENTCLOUD_REGION | Region of the COS service |
COS_SECRET_ID or TENCENTCLOUD_SECRET_ID | SecretId for COS authentication |
COS_SECRET_KEY or TENCENTCLOUD_SECRET_KEY | SecretKey for COS authentication |
COS_SECURITY_TOKEN or TENCENTCLOUD_SECURITY_TOKEN | Security token for temporary credentials (STS) |
Please be aware that when doing so in a distributed environment such as Ray, Daft will pick these credentials up from worker machines and thus each worker machine needs to be appropriately provisioned.
If instead you wish to have Daft use credentials from the "driver", you may wish to manually specify your credentials.
Manually specify credentials#
You may also choose to pass these values into your Daft I/O function calls using a daft.io.CosConfig config object.
daft.set_planning_config is a convenient way to set your daft.io.IOConfig as the default config to use on any subsequent Daft method calls.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
1 2 3 4 5 6 7 8 9 10 11 12 | |
1 2 3 4 5 6 7 | |
1 2 3 4 5 6 7 8 9 10 11 12 | |
Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the io_config= keyword argument. This is extremely flexible as you can pass a different daft.io.CosConfig per function call if you wish!
1 2 3 4 5 6 7 8 9 10 11 12 | |
Writing Data#
Daft supports writing data to COS using the same CosConfig:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Configuration Options#
The daft.io.CosConfig object supports the following options:
| Parameter | Type | Default | Description |
|---|---|---|---|
region | str | None | Region name, e.g. "ap-guangzhou", "ap-beijing", "ap-shanghai" |
endpoint | str | None | Custom endpoint URL, e.g. "https://cos.ap-guangzhou.myqcloud.com". If not provided, it will be derived from the region. |
secret_id | str | None | Tencent Cloud SecretId |
secret_key | str | None | Tencent Cloud SecretKey |
security_token | str | None | Security token for temporary credentials (STS) |
anonymous | bool | False | Whether to use anonymous access (for public buckets) |
max_retries | int | 3 | Maximum number of retries for failed requests |
retry_timeout_ms | int | 30000 | Timeout duration for retry attempts in milliseconds |
connect_timeout_ms | int | 10000 | Connection timeout in milliseconds |
read_timeout_ms | int | 30000 | Read timeout in milliseconds |
max_concurrent_requests | int | 50 | Maximum number of concurrent requests |
max_connections | int | 50 | Maximum number of connections per IO thread |
Region and Endpoint Auto-Derivation
You only need to specify either region or endpoint — Daft will automatically derive the other:
- If only
regionis provided, the endpoint is derived ashttps://cos.{region}.myqcloud.com - If only
endpointis provided, the region is extracted from the endpoint URL - If both are provided, both values are used as-is
Supported Operations#
Daft supports the following operations with COS:
- Read:
read_parquet,read_csv,read_json, and other file readers - Write:
write_parquet,write_csv,write_json(including multipart uploads) - List: Listing objects with glob pattern matching
- Delete: Deleting objects