Load Data Using the Platform Extension Framework (PXF)
The Apache Cloudberry Platform Extension Framework (PXF) is an extensible framework that allows Apache Cloudberry to query external data files whose metadata is not managed by the database. You can use its pxf://
protocol to access data residing in object store systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases.
PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables, and JDBC-accessible databases. Users can also create their own connectors to other data storage or processing engines.
The version of PXF used in Apache Cloudberry is forked from the Greenplum PXF project and has been specifically adapted for it. The source code for this adapted version is hosted by the Apache Software Foundation at apache/cloudberry-pxf.
The PXF pxf
protocol is packaged as an Apache Cloudberry extension that supports both reading from and writing to external data stores. The framework includes a C-language extension and a Java service. After you configure and initialize PXF, you start a single PXF JVM process on each Apache Cloudberry segment host. This long-running process concurrently serves multiple query requests.
Before using the pxf
protocol, you must explicitly initialize and start the PXF service. You must also enable PXF in each database where it will be used and grant permissions on the pxf
protocol to the relevant Apache Cloudberry users.
Install and Build PXF
PXF is a component that must be built and installed separately. The source code and detailed instructions for building, installing, and developing PXF are available in the official apache/cloudberry-pxf
repository. Before proceeding with the configuration, follow the development guide in the repository to set up PXF.
Build and Installation Guide: apache/cloudberry-pxf README
Configure PXF
Before you can use PXF to access external data, you must configure and initialize the service.
Initialize and Start PXF
To use PXF, you must first initialize it. This process creates the necessary configuration directory structure and files on each Apache Cloudberry segment host. After initialization, you need to start the PXF service.
-
Add the PXF binary directory to your
PATH
. This step ensures that you can runpxf
commands from any location.export PATH=/usr/local/pxf/bin:$PATH
You can add this line to your
.bashrc
or.zshrc
file to make it permanent. -
Initialize PXF. The
prepare
command sets up the PXF configuration. This command only needs to be run once.pxf prepare
-
Start the PXF service. This command starts a Java process on each segment host that acts as the external data coordinator.
pxf start
Enable PXF in a database
You must enable PXF in each database in which you want to use the pxf
protocol to access external data.
To enable PXF in a database, you must create the PXF extension in the database:
CREATE EXTENSION pxf;
The pxf
protocol is packaged as an extension with Apache Cloudberry, and you must explicitly enable it in each database in which you plan to use it.
Grant access to PXF
To allow non-superuser Apache Cloudberry roles to create external tables using the pxf
protocol, you must grant SELECT
privileges on the pxf
protocol to each role:
GRANT SELECT ON PROTOCOL pxf TO <role_name>;
To allow non-superuser Apache Cloudberry roles to create writable external tables using the pxf
protocol, you must grant INSERT
privileges on the pxf
protocol to each role:
GRANT INSERT ON PROTOCOL pxf TO <role_name>;
PXF configuration overview
PXF configuration includes server configuration and connector configuration. A PXF server is a named configuration that provides access credentials and other information required to access an external data source. A PXF connector is the interface between PXF and the external data source.
Example: configure S3 connector
Create a PXF configuration for accessing S3 data:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_SECRET_KEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
</configuration>
Example: configure JDBC connector
For accessing SQL databases:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>jdbc.driver</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>jdbc.url</name>
<value>jdbc:postgresql://hostname:5432/database</value>
</property>
<property>
<name>jdbc.user</name>
<value>username</value>
</property>
<property>
<name>jdbc.password</name>
<value>password</value>
</property>
</configuration>
Step 3. Create PXF external tables
PXF external tables use the pxf://
protocol in the LOCATION
clause. The URL format varies depending on the data source and connector.
Read from HDFS
Access text files stored in HDFS:
CREATE EXTERNAL TABLE sales_hdfs (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('pxf://hdfs-namenode:8020/data/sales/sales.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER '|');
Read from Hive tables
Access Hive tables directly:
CREATE EXTERNAL TABLE hive_sales (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('pxf://hive-metastore:9083/sales_db.sales_table?PROFILE=hive')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Read from Parquet files
Access Parquet files in object storage:
CREATE EXTERNAL TABLE parquet_data (
id bigint,
name text,
created_date date
)
LOCATION ('pxf://s3a://my-bucket/data/events.parquet?PROFILE=s3:parquet&SERVER=s3-server')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Read from SQL databases
Query external SQL databases:
CREATE EXTERNAL TABLE external_customers (
customer_id int,
customer_name text,
email text,
registration_date date
)
LOCATION ('pxf://postgresql-server/customers?PROFILE=jdbc&SERVER=postgres-server')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Write data with PXF
Create writable external tables to export data:
CREATE WRITABLE EXTERNAL TABLE export_to_hdfs (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('pxf://hdfs-namenode:8020/exports/sales_export?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER '|')
DISTRIBUTED BY (transaction_id);
PXF URL format
The PXF protocol URL follows this syntax:
pxf://<host>[:<port>]/<path-to-data>?PROFILE=<profile_name>[&<custom-option>=<value>][&SERVER=<server_name>]
Where:
host:port
: Location of the external data sourcepath-to-data
: Path to the specific data (file, directory, table, etc.)PROFILE
: PXF connector profile (e.g.,hdfs:text
,hive
,s3:parquet
,jdbc
)SERVER
: Named server configuration (optional)custom-option
: Additional connector-specific options
Common PXF profiles
Profile | Data Source | Format | Use Case |
---|---|---|---|
hdfs:text | HDFS | Text files | Delimited text data |
hdfs:avro | HDFS | Avro files | Schema evolution support |
hdfs:parquet | HDFS | Parquet files | Columnar analytics |
hdfs:orc | HDFS | ORC files | Optimized row columnar |
hive | Hive | Various | Hive table access |
hbase | HBase | HBase | NoSQL data access |
s3:text | S3 | Text files | Cloud object storage |
s3:parquet | S3 | Parquet files | Cloud analytics |
jdbc | SQL Database | Various | External database access |
Performance optimization
Partition pruning
PXF supports partition pruning for Hive tables:
SELECT * FROM hive_sales
WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';
Predicate pushdown
Enable predicate pushdown for better performance:
-- This filter can be pushed down to the external source
SELECT * FROM external_customers
WHERE registration_date > '2024-01-01';
Parallel processing
Leverage multiple files for better parallelism:
CREATE EXTERNAL TABLE multi_file_data (
id bigint,
data text
)
LOCATION ('pxf://hdfs-namenode:8020/data/partitioned/*?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER ',');
Error handling
Enable error logging for data quality monitoring:
CREATE EXTERNAL TABLE sales_with_errors (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('pxf://hdfs-namenode:8020/data/sales/?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER '|')
LOG ERRORS SEGMENT REJECT LIMIT 50;
Best practices
- Select the most specific PXF profile that matches your data format and storage system to ensure optimal performance and feature support.
- Create reusable named server configurations to centralize connection parameters and credentials, making it easier to manage multiple external data sources.
- Structure your external data using file sizes between 100MB and 1GB to achieve optimal parallel processing across Apache Cloudberry segments.
- Leverage compressed storage formats such as Parquet or ORC to reduce I/O overhead and improve query performance.
- Implement comprehensive error logging and monitoring in production environments to quickly identify and resolve data access issues.
- Establish proper authentication mechanisms and enable encryption for all connections to sensitive external data sources.
Limitations
- PXF requires the external data source to be accessible from all segment hosts.
- Some data sources may have specific version compatibility requirements.
- Write operations depend on the connector's capabilities.
- Performance can be affected by network latency to external sources.
Troubleshooting
Check PXF service status
pxf cluster status
View PXF logs
pxf cluster logs
Test connectivity
SELECT * FROM pxf_external_table LIMIT 5;
Learn More
For more details about the Apache Cloudberry Platform Extension Framework (PXF), please refer to the official PXF project repository, which contains the source code, documentation, and contribution guidelines.