Load Data Securely Using gpfdists
The gpfdists
protocol is a secure version of the gpfdist
protocol that enables encrypted communication between Apache Cloudberry and the gpfdist file server. When you use gpfdists
, all data transfer is encrypted using SSL, protecting against eavesdropping and man-in-the-middle attacks.
gpfdists
provides the same high-performance parallel data loading capabilities as gpfdist
, but with additional security features essential for production environments handling sensitive data.
Security features
- All data transmitted between Apache Cloudberry segments and gpfdist servers is encrypted using SSL/TLS protocols, protecting against eavesdropping and data interception.
- Mutual authentication is enforced through client certificates, ensuring that both Apache Cloudberry and gpfdist servers verify each other's identities before establishing connections.
- The implementation uses TLSv1 protocol with AES_128_CBC_SHA encryption algorithm to provide strong cryptographic protection for data in transit.
- Secure server identification mechanisms prevent unauthorized systems from masquerading as legitimate gpfdist servers, protecting against man-in-the-middle attacks.
Before you begin
To use gpfdists
, make sure:
- SSL certificates configured on all segment hosts.
- gpfdist utility available on the file server host.
- Network connectivity between segment hosts and the gpfdist server.
- Appropriate SSL certificate files in the correct locations.
Step 1. Set up SSL certificates
Required certificate files
The following certificate files must be present in the $PGDATA/gpfdists
directory on each Apache Cloudberry segment host:
For full SSL authentication (recommended):
client.key
- Client private key fileclient.crt
- Client certificate fileroot.crt
- Trusted certificate authorities file
Certificate requirements by configuration:
verify_gpfdists_cert | --ssl_verify_peer | Required Certificate Files |
---|---|---|
on (default) | on (default) | client.key , client.crt , root.crt |
on | off | root.crt |
off | on | client.key , client.crt |
off | off | None |
Install certificates
-
Create the gpfdists directory on each segment host:
mkdir -p $PGDATA/gpfdists
-
Copy the certificate files to each segment host:
# Copy to all segment hosts
scp client.key client.crt root.crt gpadmin@segment-host:$PGDATA/gpfdists/ -
Set appropriate permissions:
chmod 600 $PGDATA/gpfdists/client.key
chmod 644 $PGDATA/gpfdists/client.crt
chmod 644 $PGDATA/gpfdists/root.crt
Step 2. Start gpfdist with SSL
Start the gpfdist utility with the --ssl
option to enable secure connections:
gpfdist -p 8081 -d /data/load_files --ssl /path/to/certificates &
SSL options for gpfdist
--ssl <certificates_path>
: Enables SSL and specify certificate directory--ssl_verify_peer on|off
: Controls peer verification (default: on)
Example: Start multiple secure gpfdist instances
# Starts the first secure gpfdist instance.
gpfdist -d /var/load_files1 -p 8081 --ssl /home/gpadmin/certs \
--ssl_verify_peer on -l /home/gpadmin/log1 &
# Starts the second secure gpfdist instance.
gpfdist -d /var/load_files2 -p 8082 --ssl /home/gpadmin/certs \
--ssl_verify_peer on -l /home/gpadmin/log2 &
Step 3. Create external tables with gpfdists
Use the gpfdists://
protocol in the LOCATION
clause to create secure external tables:
Readable external table
CREATE EXTERNAL TABLE secure_sales_data (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdists://etl-server1:8081/sales/*.txt',
'gpfdists://etl-server2:8082/sales/*.txt')
FORMAT 'TEXT' (DELIMITER '|' NULL ' ');
Writable external table
CREATE WRITABLE EXTERNAL TABLE secure_export (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdists://etl-server1:8081/exports/sales_data.txt')
FORMAT 'TEXT' (DELIMITER '|')
DISTRIBUTED BY (transaction_id);
With error handling
CREATE EXTERNAL TABLE secure_data_with_errors (
id int,
name text,
value decimal(10,2)
)
LOCATION ('gpfdists://etl-server:8081/data/*.csv')
FORMAT 'CSV' (HEADER)
LOG ERRORS SEGMENT REJECT LIMIT 100;
Configuration parameters
Apache Cloudberry parameters
Configure these parameters in postgresql.conf
:
# Enables/disables SSL certificate verification (default: on)
verify_gpfdists_cert = on
# Control segment parallelism
gp_external_max_segs = 64
gpfdist SSL parameters
The gpfdist utility supports these SSL-related options:
Parameter | Description | Default |
---|---|---|
--ssl | Enable SSL and specify certificate path | Disabled |
--ssl_verify_peer | Verify client certificates | on |
Security best practices
Certificate management
- Generate all certificates from a trusted certificate authority to ensure proper validation and trust chain establishment.
- Implement a regular certificate rotation schedule to enhance security and prevent issues from certificate expiration.
- Store private keys in secure locations with restricted access permissions, ensuring only authorized personnel can access them.
- Maintain secure backup copies of all certificate files to enable quick recovery in case of system failures or corruption.
Network security
- Configure firewall rules to restrict access to gpfdists ports, allowing only authorized Apache Cloudberry segment hosts to connect.
- Use secure network connections such as VPNs or private networks to prevent unauthorized access to data transmission channels.
- Implement continuous monitoring of SSL connections and certificate expiration dates to proactively address security issues.
Access control
- Apply the principle of least privilege by granting only the minimum permissions necessary for users and applications to perform their required functions.
- Implement robust authentication mechanisms including multi-factor authentication where appropriate to verify user identities.
- Enable comprehensive audit logging to track all access attempts, successful connections, and security-related events for compliance and security monitoring.
Troubleshooting
SSL connection errors
Check certificate configuration:
# Verifies certificate files exist.
ls -la $PGDATA/gpfdists/
# Checks certificate validity.
openssl x509 -in $PGDATA/gpfdists/client.crt -text -noout
Troubleshoot certificate verification issues
- Verify that the
root.crt
file contains the correct certificate authority chain and that all intermediate certificates are properly included for validation. - Check certificate expiration dates using tools like
openssl x509 -dates
to ensure that certificates have not expired and plan for renewal well in advance. - Validate that the private key file
client.key
corresponds exactly to the public certificate inclient.crt
using certificate validation tools.
Common error messages
Error | Cause | Solution |
---|---|---|
"SSL certificate verify failed" | Invalid or expired certificate | Check certificate validity and CA |
"SSL handshake failed" | SSL configuration mismatch | Verify SSL settings on both sides |
"Permission denied" | Incorrect file permissions | Set proper permissions on certificate files |
Debug SSL connections
Enable verbose logging:
gpfdist -d /data -p 8081 --ssl /certs -V --ssl_verify_peer on
Performance considerations
- SSL encryption introduces computational overhead during data transmission, which may reduce overall throughput compared to unencrypted connections, especially for large data transfers.
- Certificate caching mechanisms help reduce the performance impact of SSL handshakes by reusing established secure connections across multiple data transfer operations.
- Deploy multiple gpfdists instances across different hosts or network interfaces to distribute the load and achieve better aggregate throughput for concurrent data operations.
- Ensure that your network infrastructure provides adequate bandwidth and low latency between Apache Cloudberry segments and gpfdist servers to minimize performance bottlenecks.
Migration from gpfdist to gpfdists
To migrate existing gpfdist usage to gpfdists:
- Install and configure the required SSL certificates on all Apache Cloudberry segment hosts, ensuring proper permissions and certificate chain validation.
- Update all external table definitions to change protocol specifications from
gpfdist://
togpfdists://
in their LOCATION clauses. - Restart your gpfdist file servers with the
--ssl
option enabled, specifying the appropriate certificate directories and SSL verification settings. - Thoroughly test secure connections from all Apache Cloudberry segments to verify that data loading operations work correctly with the new encrypted protocol.
- Update any automation scripts, monitoring tools, and operational procedures to account for the new secure protocol requirements and certificate management tasks.
Example migration
Before (insecure):
LOCATION ('gpfdist://etl-server:8081/data.txt')
After (secure):
LOCATION ('gpfdists://etl-server:8081/data.txt')
Limitations
- Cannot mix
gpfdist://
andgpfdists://
protocols in the same external table definition. - All Apache Cloudberry segment hosts must have properly configured SSL certificates and trust relationships.
- SSL encryption introduces computational overhead that may reduce data transfer throughput compared to unencrypted gpfdist.
- Ongoing certificate lifecycle management is required, including renewal, rotation, and revocation processes.