Twitter

Load 2 billion rows with SQL LOADER

I worked on few Extract data / transform data (nor not) / load data into an Oracle database projects during my professional life and SQL Loader has always been my tool of choice when it comes to loading data into an Oracle database; indeed:
  • It is simple
  • It is fast
  • It natively supports CSV intput files
  • It is default so it is available with every Oracle client/instant client, no need to install any extra stuff

Having said that, I recently ran into kind of a challenge when the number of rows to load were 1.7 billion (1,700,000,000) -- sorry for the click bait title with 2 billion but round numbers talk better to people :)
$ wc -l 2bill_load.csv
1704431835 2bill_load.csv
$ du -sh 2bill_load.csv
665G    2bill_load.csv
$

Before jumping into this, please note that all of this has been achieved using the below configuration:
When loading this file into the target ADB using SQL Loader, it took 9h33 hours
$ sqlldr userid="user/password@service" control=mycontrol.sql direct=true etc...
. . .
Total logical records read:      1704431835
Elapsed time was:     09:33:50.10
CPU time was:         02:42:12.76
. . .
$

When you do the maths, it is actually not that bad:
  • 1.7 billion rows in 9h33
  • 179 million rows per hour
  • 6,827,309 million rows per minutes
  • 113,788 rows per seconds

But the whole picture is kind of slow so I looked for a workaround idea and found that very cool SQL Loader option:
Note that this option is not like "load this big file using X parallel process"; it is to load different input files in parallel and parallel=true is to let SQL Loader know that you are loading X files into the same table in parallel.

I actually didn't have 1 big file but I had around 200,000 files, the sum being 1.7 billion rows so this parallel option was exactly what I was looking for. I just had to write a simple parallel execution as below:
L_LOAD_PARALLEL=16                                                             # Parallel degree
           l_nb=0                                                              # A counter
     l_nb_files=$(find . -type f -size +0 | wc -l)                             # Number of files to load
for F in $(find . -type f -size +0); do
    ((l_nb++))
    sqlldr userid="user/password@service" control=mycontrol.sql direct=true data="${F}" parallel=true etc... &
    if ! ((l_nb%L_LOAD_PARALLEL)); then
        echo "Max parallel degree of ${L_LOAD_PARALLEL} reached; waiting. . ."
        echo "INFO" "Files loaded: ${l_nb}/${l_nb_files}"
        wait
    fi
done
wait
And by just running this load with a parallel of 16, the whole load only took 3 hours (knowing that I also removed potential duplicate rows in each file during this process) -- pretty cool right ! Depending on your VM and the shared load on that VM (I didnt push too much that // degree as the VM I used has other activity running on it which I didnt want to disrupt), you could run with more parallel degree to speed that up even more.

If you have 1 big file with so many rows, you will have to split that file in smaller files before; below an example with a 100 million rows file splitted into 50 MB files:
$ wc -l 100_mill_file.csv
100000000 100_mill_file.csv
$ du -sh 100_mill_file.csv
40G     100_mill_file.csv
$ time split -b 50000000 100_mill_file.csv --additional-suffix ".dat"   <== split in 50MB files
real    0m44.538s           <== 44 seconds
user    0m0.094s
sys     0m13.751s
$ ls -ltr
total 779134832
-rw-rw-r--. 1 opc opc  41974611834 Mar 18 10:22 100_mill_file.csv
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xaa.dat
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xab.dat
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xac.dat
-rw-rw-r--. 1 opc opc     50000000 Mar 18 10:24 xad.dat
. . .
$ du -sh xaa.dat
48M     xaa.dat
$ ls -ltr x*.dat | wc -l
840
$
Note that SQL Loader requires a file extension thus the --additional-suffix ".dat".

That's all for this one, enjoy the SQL Loader power!

1 comment:

  1. Nice! That is a great point on the parallel option, lots of folks misunderstand that option.

    ReplyDelete

CUDA: Getting started on Google Colab

While getting started with CUDA on Windows or on WSL (same on Linux) requires to install some stuff, it is not the case when using Google...