This tutorial briefly introduces 3 methods to load the example data FAA
into Cloudberry Database tables you have created in the previous tutorial Lesson 3: Create Tables. Before continuing, make sure you have completed the previous tutorial.
-
Method 1: Use the
INSERT
statement. This is the easiest way to load data. You can executeINSERT
directly in psql, run scripts that haveINSERT
statements, or run a client-side application with database connection. It is not recommended to useINSERT
to load a large amount of data, because the loading efficiency is low. -
Method 2: Use the SQL statement
COPY
to load data into database. TheCOPY
syntax allows you to define the format of the text file so that data can be parsed into rows and columns. This method is faster than theINSERT
statement. But, likeINSERT
statement,COPY
is not a parallel data loading process.The
COPY
statement requires that external files be accessible to the host where the master process is running. On a multi-node Cloudberry Database system, data files might reside on a file system that is not accessible from master node. In this case, you need to use the psql command\copy meta-command
that streams data to Cloudberry master node overpsql
connection. Some example scripts in this tutorial use the\copy meta-command
. -
Method 3: Use Cloudberry Database utilities to load external data into tables. When you are working with a large-scale data warehouse, you might often face the challenge of loading large amounts of data in a short time. The utilities,
gpfdist
andgpload
, are tailored for this purpose, enabling you to achieve rapid, parallel data transfers.During your data loading process, if any rows run into issues, they will be noted. You can set an error threshold that fits your needs. If the number of problematic rows exceeds this limit, Cloudberry Database will stop the loading process.
For optimal speed, combine the use of external tables with the parallel file server (
gpfdist
). This approach will help you maximize efficiency, making your data loading tasks smoother and more efficient.Figure 1. External Tables Using Parallel File Server (gpfdist)
Another utility
gpload
is a batch job. When using this utility, you should specify a YAML-formatted control file, describe source data locations, format, transformations required, participating hosts, database destinations and other particulars in the file.gpload
will parse the control file and usegpfdsit
to execute the task. This allows you to describe a complex task and execute it in a controlled and repeatable way.