Reading from and Writing to PostgreSQL#
PostgreSQL is an open-source object-relational database system known for its reliability, feature robustness, and performance.
Installing Daft with PostgreSQL Support#
Install Daft with the daft[postgres] extra, or manually install the required packages: psycopg and pgvector for vector support.
1 | |
Tutorial#
Reading a Table#
To read from PostgreSQL tables, use Daft's Catalog API. Daft provides high-level APIs for connecting to PostgreSQL databases and accessing tables as DataFrames.
To connect to a PostgreSQL database, provide a connection string to Catalog.from_postgres() and a list of PostgreSQL extensions to install. By default, the pgvector vector extension is automatically installed if available for vector support.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
After connecting to your database, reading a table is extremely easy:
1 2 3 4 5 6 | |
Daft supports parallel reads by partitioning on a column:
1 2 3 | |
Writing to a Table#
To write DataFrames to PostgreSQL tables, you can create tables and use the append or overwrite operations.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
You can also overwrite existing tables:
1 2 3 4 5 6 7 8 9 | |
Type System#
| Daft Type | PostgreSQL Type |
|---|---|
daft.DataType.bool() | BOOLEAN |
daft.DataType.int8() | SMALLINT |
daft.DataType.int16() | SMALLINT |
daft.DataType.int32() | INTEGER |
daft.DataType.int64() | BIGINT |
daft.DataType.float32() | REAL |
daft.DataType.float64() | DOUBLE PRECISION |
daft.DataType.decimal128(precision, scale) | NUMERIC |
daft.DataType.string() | TEXT |
daft.DataType.binary() | BYTEA |
daft.DataType.date() | DATE |
daft.DataType.timestamp(timeunit="us", timezone=None) | TIMESTAMP |
daft.DataType.timestamp(timeunit="us", timezone="UTC") | TIMESTAMPTZ |
daft.DataType.list(element_type) | element_type[] |
daft.DataType.struct(fields) | JSONB |
daft.DataType.map(key_type, value_type) | JSONB |
daft.DataType.embedding(size) | VECTOR(size) |
Reference#
Daft provides high-level Session and Catalog APIs for reading and writing PostgreSQL tables. However, the Catalog.from_postgres and table operations are the primary entry points for PostgreSQL interactions.
Catalog Operations#
The PostgreSQL catalog supports the full range of catalog operations:
- Namespaces (Schemas):
create_namespace(),drop_namespace(),list_namespaces(),has_namespace() - Tables:
create_table(),create_table_if_not_exists(),drop_table(),get_table(),has_table(),list_tables(),read_table(),write_table()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Table Operations#
PostgreSQL tables support read, append, and overwrite operations:
1 2 3 4 5 6 7 | |
FAQs#
-
Does Daft support PostgreSQL extensions?
Yes! Daft can automatically install and configure PostgreSQL extensions if they're available. The pgvector extension
vectoris installed by default, if available. -
How does Daft handle PostgreSQL data types?
Daft provides comprehensive type mapping between Daft's Arrow-based types and PostgreSQL types. Complex types like arrays and JSON are supported, with fallbacks to JSONB for unsupported structures.
-
Does Daft support PostgreSQL Row Level Security?
Yes, Daft automatically enables Row Level Security on newly created tables for enhanced security. This can be controlled via the
enable_rlsproperty when creating tables. -
How does Daft handle PostgreSQL schemas and search paths?
Daft respects PostgreSQL's schema search path. When no schema is specified, PostgreSQL uses its configured search path to resolve table names.
-
Can Daft work with PostgreSQL in distributed environments?
Yes, when using Daft's distributed runner, PostgreSQL writes can be distributed across multiple machines for maximum performance.
-
How does Daft handle vector embeddings in PostgreSQL?
Daft has native support for pgvector's VECTOR type. Embedding columns are automatically converted between Daft's embedding type and PostgreSQL's vector type during reads and writes.