Handle Data Loading Errors
Real-world data is often imperfect, containing formatting errors, missing values, or inconsistent data types. Apache Cloudberry provides robust error handling mechanisms that allow you to load correctly formatted data while isolating and managing problematic rows, ensuring your ETL processes are resilient and reliable.
By default, if external table data contains any error, the entire load operation fails and no data is loaded. With error handling enabled, you can load valid data and deal with problematic rows separately.
Error handling modes
Apache Cloudberry supports two error handling approaches:
- Single row error isolation allows the system to skip individual problematic rows and continue processing the remaining valid data, preventing entire operations from failing due to isolated data quality issues.
- Error logging functionality captures comprehensive details about problematic data rows, including error descriptions, line numbers, and the actual data that caused the failure, enabling thorough analysis and remediation.
Single row error isolation
Basic error isolation
Enable single row error isolation by adding SEGMENT REJECT LIMIT
to your external table definition:
CREATE EXTERNAL TABLE sales_data_with_errors (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
FORMAT 'CSV' (HEADER)
SEGMENT REJECT LIMIT 100;
This configuration allows up to 100 rows with errors per segment before the operation fails.
Percentage-based limits
You can also specify error limits as a percentage:
CREATE EXTERNAL TABLE sales_data (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
FORMAT 'CSV' (HEADER)
SEGMENT REJECT LIMIT 5 PERCENT;
This allows up to 5% of rows to contain errors before failing.
Error logging
Enable error logging
Use LOG ERRORS
to capture detailed information about rejected rows:
CREATE EXTERNAL TABLE sales_data_logged (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
FORMAT 'CSV' (HEADER)
LOG ERRORS
SEGMENT REJECT LIMIT 50;
Persistent error logging
For long-term error analysis, use persistent error logging:
CREATE EXTERNAL TABLE sales_data_persistent (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
FORMAT 'CSV' (HEADER)
LOG ERRORS PERSISTENTLY
SEGMENT REJECT LIMIT 25;
View error information
Query error logs
When error logging is enabled, Apache Cloudberry creates error log tables that you can query:
-- Views recent errors from the current session.
SELECT * FROM gp_read_error_log('sales_data_logged');
Error log table structure
The error log contains these columns:
Column | Description |
---|---|
cmdtime | Timestamp when the error occurred |
relname | Name of the external table |
filename | Source file containing the error |
linenum | Line number in the source file |
bytenum | Byte position in the source file |
errmsg | Error message description |
rawdata | Raw data that caused the error |
rawbytes | Raw bytes of the problematic data |
Example error analysis
-- Find the most common error types
SELECT errmsg, COUNT(*) as error_count
FROM gp_read_error_log('sales_data_logged')
GROUP BY errmsg
ORDER BY error_count DESC;
-- Views specific error details.
SELECT cmdtime, filename, linenum, errmsg, rawdata
FROM gp_read_error_log('sales_data_logged')
WHERE errmsg LIKE '%invalid input syntax%'
ORDER BY cmdtime DESC;
Common data errors and solutions
Data type conversion errors
Error: invalid input syntax for type numeric
Cause: Non-numeric data in numeric columns
Solution: Clean data or use text columns with post-processing
-- Original problematic data: "N/A" in amount column.
-- Solution: Use text type and handle conversion later.
CREATE EXTERNAL TABLE sales_flexible (
transaction_id int,
product_name text,
sale_date text, -- Use text for flexible parsing
amount text -- Use text to handle "N/A" values
)
LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
FORMAT 'CSV' (HEADER)
LOG ERRORS SEGMENT REJECT LIMIT 10 PERCENT;
Date format issues
Error: invalid input syntax for type date
Cause: Inconsistent date formats
Solution: Standardize date formats or use flexible parsing.
-- Handles multiple date formats in post-processing.
SELECT
transaction_id,
product_name,
CASE
WHEN sale_date ~ '^\d{4}-\d{2}-\d{2}$' THEN sale_date::date
WHEN sale_date ~ '^\d{2}/\d{2}/\d{4}$' THEN to_date(sale_date, 'MM/DD/YYYY')
ELSE NULL
END as parsed_date,
amount::decimal(10,2)
FROM sales_flexible
WHERE sale_date IS NOT NULL;
Miss or extra columns
Error: extra data after last expected column
Cause: Inconsistent number of columns
Solution: Use more flexible table definition
-- Adds extra optional columns to handle variable column counts.
CREATE EXTERNAL TABLE flexible_sales (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2),
extra_field1 text, -- Optional fields
extra_field2 text,
extra_field3 text
)
LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
FORMAT 'CSV' (HEADER)
LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT;
Character encoding issues
Error: invalid byte sequence
Cause: Character encoding mismatch
Solution: Specify correct encoding
CREATE EXTERNAL TABLE encoded_data (
id int,
description text
)
LOCATION ('gpfdist://etl-server:8081/data/*.txt')
FORMAT 'TEXT' (DELIMITER '|')
ENCODING 'LATIN1' -- Specify encoding
LOG ERRORS SEGMENT REJECT LIMIT 5 PERCENT;
Error handling strategies
Two-phase loading
Use a staging approach for complex data cleaning:
-- Phase 1: Load into staging table with flexible types
CREATE EXTERNAL TABLE sales_staging (
transaction_id text,
product_name text,
sale_date text,
amount text,
raw_line text -- Store entire row for complex cases
)
LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
FORMAT 'CSV'
LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT;
-- Phase 2: Clean and insert into final table
INSERT INTO sales_final (transaction_id, product_name, sale_date, amount)
SELECT
transaction_id::int,
product_name,
sale_date::date,
amount::decimal(10,2)
FROM sales_staging
WHERE transaction_id ~ '^\d+$' -- Validate numeric ID
AND sale_date ~ '^\d{4}-\d{2}-\d{2}$' -- Validate date format
AND amount ~ '^\d+\.?\d*$'; -- Validate amount format
Error threshold monitoring
Set up monitoring for error rates:
-- Function to check error rate
CREATE OR REPLACE FUNCTION check_error_rate(table_name text, threshold_percent numeric)
RETURNS boolean AS $$
DECLARE
error_count int;
total_count int;
error_rate numeric;
BEGIN
SELECT COUNT(*) INTO error_count FROM gp_read_error_log(table_name);
-- Estimate total processed rows (depends on your tracking method)
SELECT reltuples INTO total_count FROM pg_class WHERE relname = table_name;
IF total_count > 0 THEN
error_rate := (error_count::numeric / total_count) * 100;
RETURN error_rate <= threshold_percent;
END IF;
RETURN true;
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT check_error_rate('sales_data_logged', 5.0); -- Check if error rate is under 5%
Best practices
Design for errors
- Begin with permissive table definitions using text data types and generous error limits to understand the full scope of data quality issues before implementing strict validation rules.
- Implement data validation incrementally by gradually adding constraints and type conversions as you gain confidence in data quality and identify patterns in the error logs.
- Establish regular monitoring and review processes for error logs to identify systematic data quality issues and trends that may indicate problems with source systems or data processing pipelines.
Error limit guidelines
Data Quality | Suggested Reject Limit | Use Case |
---|---|---|
High quality | 1-10 rows | Production systems |
Medium quality | 1-5% | Development/testing |
Poor quality | 10-20% | Initial data exploration |
Unknown quality | 50% | Data discovery phase |
Operational procedures
- Establish a regular schedule for cleaning up old error logs to prevent excessive storage consumption and maintain system performance.
- Configure monitoring systems to alert when error rates exceed predefined thresholds, enabling quick response to data quality issues.
- Maintain communication channels with data providers to share error patterns and collaborate on improving source data quality.
Performance considerations
- Error logging functionality introduces some computational overhead during data loading operations, so consider this impact when processing large datasets.
- Setting higher segment reject limits allows more rows to be processed before the operation fails, but this also means more resources are consumed analyzing problematic data.
- Error log tables can grow significantly in production environments, so implement monitoring to track storage usage and prevent disk space issues.
Troubleshooting
High error rates
If you are experiencing high error rates:
- Analyze the error log entries to identify patterns or systematic issues in your data, such as consistent formatting problems or missing values in specific columns.
- Work with data providers to verify the quality and consistency of source data, including checking for recent changes in data formats or processing.
- Carefully review your external table definitions to ensure that column data types, delimiters, and format specifications accurately match the actual data structure.
- Start troubleshooting with small data samples to isolate issues quickly before processing larger datasets.
Performance issues
If error handling is impacting performance:
- Fine-tune your segment reject limits to balance between fault tolerance and processing efficiency, avoiding unnecessarily high thresholds that waste resources.
- For complex data with known quality issues, consider implementing a two-phase loading process using staging tables with flexible data types.
- Break large data loads into smaller, manageable batches to reduce memory pressure and improve error isolation.
- Continuously monitor system resources including memory usage, disk I/O, and storage consumption during data loading operations.