Reading from Apache Gravitino#
Apache Gravitino is an open-source data catalog that provides unified metadata management for various data sources and storage systems. Users of Gravitino can work with data assets such as tables (Iceberg, Hive, etc.) and filesets (storing raw files, on s3, gcs, azure blob, etc).
To use Daft with Gravitino, you will need to install Daft with the gravitino option specified like so:
1 | |
Warning
These APIs are in beta and may be subject to change as the Gravitino connector continues to be developed.
Features#
- Catalog Navigation: List catalogs, schemas, and tables
- Multi-Format Tables: Read Iceberg and Hive/Parquet tables via
catalog.get_table("...").read() - Table Management: Load existing tables or create new external tables
- Fileset Support: Access Gravitino filesets for file storage
- GVFS Protocol: Read and write files using
gvfs://URLs for seamless fileset access - Authentication: Supports simple and OAuth2 authentication methods
- Daft Catalog Integration: Integration with Daft's catalog system via
Catalog.from_gravitino()
Connecting to Gravitino#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Configuration#
Authentication#
Catalog.from_gravitino supports two authentication methods:
- Simple Authentication: Uses username/password or just username
- OAuth2: Uses bearer token authentication
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Storage Credentials#
Gravitino manages storage credentials through table and fileset properties. The client automatically extracts and configures:
- S3: Access key, secret key, and session token
GVFS Protocol Support#
Daft supports reading and writing files directly from Gravitino filesets using the gvfs:// protocol. This provides a unified interface for accessing files stored in various cloud storage systems through Gravitino's metadata management.
GVFS URL Format#
GVFS URLs follow this format:
1 | |
Where: - <catalog> - Name of the Gravitino catalog - <schema> - Name of the schema within the catalog - <fileset> - Name of the fileset - <path> - Optional path to specific files within the fileset
Reading Files with GVFS#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
Writing Files with GVFS#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
GVFS Benefits#
- Unified Access: Use the same URL format for reading and writing
- Storage Abstraction: Access files without knowing underlying storage details (S3, GCS, etc.)
- Metadata Integration: Leverage Gravitino's catalog metadata for data discovery
- Credential Management: Gravitino handles storage credentials automatically
- Multi-format Support: Works with Parquet, CSV, JSON, and other file formats
API Reference#
Catalog.from_gravitino(...)#
Creates a Daft Catalog from a Gravitino metalake.
1 2 3 4 5 6 7 8 | |
Table.from_gravitino(table)#
Creates a Daft Table from a GravitinoTable.
1 2 3 4 5 6 7 8 9 | |
Requirements#
- Apache Gravitino server (0.9.0+)
- Python requests library
- Appropriate cloud storage credentials configured in Gravitino
Compatibility#
This integration supports both legacy and current Gravitino API formats:
- Legacy format (pre-1.0): Storage location in
properties.location - Current format (1.0+): Multiple storage locations in
storageLocationswith configurable default
The client automatically detects and handles both formats for seamless compatibility.
Limitations#
- Credential vending is not yet implemented
- This version directly calls Gravitino RESTful API, not using Gravitino Python client
- GVFS write support currently works with S3-backed filesets (other storage backends coming soon)
- Some advanced Gravitino features may not be exposed through this client
Roadmap#
- Support for read/write Iceberg tables from Gravitino ✓
- Support for Hive/Parquet tables ✓
- Support for additional table formats (Hudi)
- Support for more storages (gcs, azure adls, oss, etc)
- Support for credential vending
Please open issues on the Daft repository or Gravitino repository if you have any use-cases that Daft Gravitino connector does not currently cover!