Lesson 4: Data Loading
This tutorial briefly introduces 3 methods to load the example data FAA
into Apache Cloudberry tables you have created in the previous tutorial Lesson 3: Create Tables. Before continuing, make sure you have completed the previous tutorial.
-
Method 1: Use the
INSERT
statement. This is the easiest way to load data. You can executeINSERT
directly in psql, run scripts that haveINSERT
statements, or run a client-side application with database connection. It is not recommended to useINSERT
to load a large amount of data, because the loading efficiency is low. -
Method 2: Use the SQL statement
COPY
to load data into database. TheCOPY
syntax allows you to define the format of the text file so that data can be parsed into rows and columns. This method is faster than theINSERT
statement. But, likeINSERT
statement,COPY
is not a parallel data loading process.The
COPY
statement requires that external files be accessible to the host where the coordinator process is running. On a multi-node Apache Cloudberry system, data files might reside on a file system that is not accessible from coordinator node. In this case, you need to use the psql command\copy meta-command
that streams data to Cloudberry coordinator node overpsql
connection. Some example scripts in this tutorial use the\copy meta-command
. -
Method 3: Use Apache Cloudberry utilities to load external data into tables. When you are working with a large-scale data warehouse, you might often face the challenge of loading large amounts of data in a short time. The utilities,
gpfdist
andgpload
, are tailored for this purpose, enabling you to achieve rapid, parallel data transfers.During your data loading process, if any rows run into issues, they will be noted. You can set an error threshold that fits your needs. If the number of problematic rows exceeds this limit, Apache Cloudberry will stop the loading process.
For optimal speed, combine the use of external tables with the parallel file server (
gpfdist
). This approach will help you maximize efficiency, making your data loading tasks smoother and more efficient.Figure 1. External Tables Using Parallel File Server (gpfdist)
Another utility
gpload
is a batch job. When using this utility, you should specify a YAML-formatted control file, describe source data locations, format, transformations required, participating hosts, database destinations and other particulars in the file.gpload
will parse the control file and usegpfdist
to execute the task. This allows you to describe a complex task and execute it in a controlled and repeatable way.
Quick-start operations
In the following exercise, you will load sample data into the tutorial
database using each of these above methods.
Load data using INSERT
In Lesson 3: Create Tables, you have created 6 tables in the tutorial
database, one of which is d_cancellation_codes
in the faa
directory.
The faa.d_cancellation_codes
table is a simple 2-column look-up table. You will load data into the table using the INSERT
statement.
-
Log into Apache Cloudberry in Docker as
gpadmin
, and change to thefaa
directory. This directory containsfaa
data and scripts.[gpadmin@cdw ~]$ cd /tmp/faa
-
Log into the
tutorial
database aslily
.[gpadmin@cdw faa]$ psql -U lily -d tutorial
Password for user lily: # changeme
psql (14.4, server 14.4)
Type "help" for help.
tutorial=> -
Check the
faa.d_cancellation_codes
table.tutorial=> \d d_cancellation_codes
Output:
Table "faa.d_cancellation_codes"
Column | Type | Collation | Nullable | Default
-------------+------+-----------+----------+---------
cancel_code | text | | |
cancel_desc | text | | |
Distributed by: (cancel_code) -
Insert data into the
faa.d_cancellation_codes
table.tutorial=> INSERT INTO faa.d_cancellation_codes
tutorial-> VALUES ('A', 'Carrier'),
tutorial-> ('B', 'Weather'),
tutorial-> ('C', 'NAS'),
tutorial-> ('D', 'Security'),
tutorial-> ('', 'none');Output:
INSERT 0 5
tutorial=>
Load data using COPY
The COPY
statement moves data from the file system to database tables. Data for 5 of the faa
tables is in the following CSV-formatted text files:
-
In a text editor, open and check the following
.csv
data files.L_AIRLINE_ID.csv
L_AIRPORTS.csv
L_DISTANCE_GROUP_250.csv
L_ONTIME_DELAY_GROUPS.csv
L_WORLD_AREA_CODES.csv
Note that the first line of each file contains the column names, and that the last line of each file contains the characters
.
, which signals the end of the input data. -
In a text editor, open and check the following SQL scripts:
copy_into_airlines.sql
copy_into_airports.sql
copy_into_delay_groups.sql
copy_into_distance_groups.sql
copy_into_wac.sql
-
Log into the
tutorial
database aslily
.[gpadmin@cdw faa]$ psql -U lily -d tutorial
Password for user lily: # changeme
psql (14.4, server 14.4)
Type "help" for help.
tutorial=>