I worked on few Extract data / transform data (nor not) / load data into an Oracle database projects during my professional life and SQL Loader has always been my tool of choice when it comes to loading data into an Oracle database; indeed:
Having said that, I recently ran into kind of a challenge when the number of rows to load were 1.7 billion (1,700,000,000) -- sorry for the click bait title with 2 billion but round numbers talk better to people :)
Before jumping into this, please note that all of this has been achieved using the below configuration:
When loading this file into the target ADB using SQL Loader, it took 9h33 hours
When you do the maths, it is actually not that bad:
But the whole picture is kind of slow so I looked for a workaround idea and found that very cool SQL Loader option:
Note that this option is not like "load this big file using X parallel process"; it is to load different input files in parallel and parallel=true is to let SQL Loader know that you are loading X files into the same table in parallel.
I actually didn't have 1 big file but I had around 200,000 files, the sum being 1.7 billion rows so this parallel option was exactly what I was looking for. I just had to write a simple parallel execution as below:
If you have 1 big file with so many rows, you will have to split that file in smaller files before; below an example with a 100 million rows file splitted into 50 MB files:
That's all for this one, enjoy the SQL Loader power!
- It is simple
- It is fast
- It natively supports CSV intput files
- It is default so it is available with every Oracle client/instant client, no need to install any extra stuff
Having said that, I recently ran into kind of a challenge when the number of rows to load were 1.7 billion (1,700,000,000) -- sorry for the click bait title with 2 billion but round numbers talk better to people :)
$ wc -l 2bill_load.csv 1704431835 2bill_load.csv $ du -sh 2bill_load.csv 665G 2bill_load.csv $
Before jumping into this, please note that all of this has been achieved using the below configuration:
- SQL Loader 19
- A VM.Standard.E4.Flex OCI VM with 4 OCPU and 64 GB RAM
- A 1TB paravrtualized OCI block volume (performance: Balanced) to store the input file(s)
- An Oracle Autonomous Database workload type "Data Warehouse" as a target database
When loading this file into the target ADB using SQL Loader, it took 9h33 hours
$ sqlldr userid="user/password@service" control=mycontrol.sql direct=true etc... . . . Total logical records read: 1704431835 Elapsed time was: 09:33:50.10 CPU time was: 02:42:12.76 . . . $
When you do the maths, it is actually not that bad:
- 1.7 billion rows in 9h33
- 179 million rows per hour
- 6,827,309 million rows per minutes
- 113,788 rows per seconds
But the whole picture is kind of slow so I looked for a workaround idea and found that very cool SQL Loader option:
Note that this option is not like "load this big file using X parallel process"; it is to load different input files in parallel and parallel=true is to let SQL Loader know that you are loading X files into the same table in parallel.
I actually didn't have 1 big file but I had around 200,000 files, the sum being 1.7 billion rows so this parallel option was exactly what I was looking for. I just had to write a simple parallel execution as below:
L_LOAD_PARALLEL=16 # Parallel degree l_nb=0 # A counter l_nb_files=$(find . -type f -size +0 | wc -l) # Number of files to load for F in $(find . -type f -size +0); do ((l_nb++)) sqlldr userid="user/password@service" control=mycontrol.sql direct=true data="${F}" parallel=true etc... & if ! ((l_nb%L_LOAD_PARALLEL)); then echo "Max parallel degree of ${L_LOAD_PARALLEL} reached; waiting. . ." echo "INFO" "Files loaded: ${l_nb}/${l_nb_files}" wait fi done waitAnd by just running this load with a parallel of 16, the whole load only took 3 hours (knowing that I also removed potential duplicate rows in each file during this process) -- pretty cool right ! Depending on your VM and the shared load on that VM (I didnt push too much that // degree as the VM I used has other activity running on it which I didnt want to disrupt), you could run with more parallel degree to speed that up even more.
If you have 1 big file with so many rows, you will have to split that file in smaller files before; below an example with a 100 million rows file splitted into 50 MB files:
$ wc -l 100_mill_file.csv 100000000 100_mill_file.csv $ du -sh 100_mill_file.csv 40G 100_mill_file.csv $ time split -b 50000000 100_mill_file.csv --additional-suffix ".dat" <== split in 50MB files real 0m44.538s <== 44 seconds user 0m0.094s sys 0m13.751s $ ls -ltr total 779134832 -rw-rw-r--. 1 opc opc 41974611834 Mar 18 10:22 100_mill_file.csv -rw-rw-r--. 1 opc opc 50000000 Mar 18 10:24 xaa.dat -rw-rw-r--. 1 opc opc 50000000 Mar 18 10:24 xab.dat -rw-rw-r--. 1 opc opc 50000000 Mar 18 10:24 xac.dat -rw-rw-r--. 1 opc opc 50000000 Mar 18 10:24 xad.dat . . . $ du -sh xaa.dat 48M xaa.dat $ ls -ltr x*.dat | wc -l 840 $Note that SQL Loader requires a file extension thus the --additional-suffix ".dat".
That's all for this one, enjoy the SQL Loader power!
Nice! That is a great point on the parallel option, lots of folks misunderstand that option.
ReplyDelete