An Unknown DBA blog: Load 2 billion rows with SQL LOADER

I worked on few Extract data / transform data (nor not) / load data into an Oracle database projects during my professional life and SQL Loader has always been my tool of choice when it comes to loading data into an Oracle database; indeed:

It is simple
It is fast
It natively supports CSV intput files
It is default so it is available with every Oracle client/instant client, no need to install any extra stuff

Having said that, I recently ran into kind of a challenge when the number of rows to load were 1.7 billion (1,700,000,000) -- sorry for the click bait title with 2 billion but round numbers talk better to people :)

$ wc -l 2bill_load.csv
1704431835 2bill_load.csv
$ du -sh 2bill_load.csv
665G    2bill_load.csv
$

Before jumping into this, please note that all of this has been achieved using the below configuration:

When loading this file into the target ADB using SQL Loader, it took 9h33 hours

$ sqlldr userid="user/password@service" control=mycontrol.sql direct=true etc...
. . .
Total logical records read:      1704431835
Elapsed time was:     09:33:50.10
CPU time was:         02:42:12.76
. . .
$

When you do the maths, it is actually not that bad:

1.7 billion rows in 9h33
179 million rows per hour
6,827,309 million rows per minutes
113,788 rows per seconds

But the whole picture is kind of slow so I looked for a workaround idea and found that very cool SQL Loader option:

PARALLEL=[TRUE | FALSE]

Note that this option is not like "load this big file using X parallel process"; it is to load different input files in parallel and parallel=true is to let SQL Loader know that you are loading X files into the same table in parallel.

I actually didn't have 1 big file but I had around 200,000 files, the sum being 1.7 billion rows so this parallel option was exactly what I was looking for. I just had to write a simple parallel execution as below:

L_LOAD_PARALLEL=16                                                             # Parallel degree
           l_nb=0                                                              # A counter
     l_nb_files=$(find . -type f -size +0 | wc -l)                             # Number of files to load
for F in $(find . -type f -size +0); do
    ((l_nb++))
    sqlldr userid="user/password@service" control=mycontrol.sql direct=true data="${F}" parallel=true etc... &
    if ! ((l_nb%L_LOAD_PARALLEL)); then
        echo "Max parallel degree of ${L_LOAD_PARALLEL} reached; waiting. . ."
        echo "INFO" "Files loaded: ${l_nb}/${l_nb_files}"
        wait
    fi
done
wait

And by just running this load with a parallel of 16, the whole load only took 3 hours (knowing that I also removed potential duplicate rows in each file during this process) -- pretty cool right ! Depending on your VM and the shared load on that VM (I didnt push too much that // degree as the VM I used has other activity running on it which I didnt want to disrupt), you could run with more parallel degree to speed that up even more.

If you have 1 big file with so many rows, you will have to split that file in smaller files before; below an example with a 100 million rows file splitted into 50 MB files:

$ wc -l 100_mill_file.csv
100000000 100_mill_file.csv
$ du -sh 100_mill_file.csv
40G     100_mill_file.csv
$ time split -b 50000000 100_mill_file.csv --additional-suffix ".dat"   <== split in 50MB files
real    0m44.538s           <== 44 seconds
user    0m0.094s
sys     0m13.751s
$ ls -ltr
total 779134832
-rw-rw-r--. 1 opc opc  41974611834 Mar 18 10:22 100_mill_file.csv
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xaa.dat
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xab.dat
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xac.dat
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xad.dat
. . .
$ du -sh xaa.dat
48M     xaa.dat
$ ls -ltr x*.dat | wc -l
840
$

Note that SQL Loader requires a file extension thus the --additional-suffix ".dat".

That's all for this one, enjoy the SQL Loader power!

5 comments:

Ric Van DykeMarch 28, 2024 at 11:07 PM
Nice! That is a great point on the parallel option, lots of folks misunderstand that option.
AnonymousApril 1, 2025 at 2:35 AM
Fyi, SQL Loader in 23ai now does parallel loading automatically.
AnonymousMay 6, 2025 at 11:58 AM
Thanks for this article, I have an issue with Sql Loader in ATP-S, it is inserting duplicate records, SQL loader data file has 5mn records but the SQL Loader direct path loading is inserting additional records 5mn plus additional 6 duplicate records from a single thread, Can you please advise troubleshoot my issue, Thanks! 5000006 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.

Date conversion cache disabled due to overflow (default size: 1000)

Bind array size not used in direct path.
Column array rows : 5000
Stream buffer bytes: 256000
Read buffer bytes: 1048576

Total logical records skipped: 0
Total logical records read: 5000000
Total logical records rejected: 0
Total logical records discarded: 0
Total stream buffers loaded by SQL*Loader main thread: 2891
Total stream buffers loaded by SQL*Loader load thread: 2890

An Unknown DBA blog

Twitter

Load 2 billion rows with SQL LOADER

5 comments:

OCI: Datapump between 23ai ADB and 19c ADB using database link

Linkedin

Twitter