Twitter

How I beat the most modern Cloud with a command developed in . . . 1976

42. For decades, we thought that 42 was the answer to the Ultimate Question of Life, the Universe, and Everything but it seems that a subtle sed s'/42/Move to the Cloud !!/' has happened and we now got another answer to pretty much everything. And this is kind of cool as plenty of new challenging projects come to us with new tools and modern technologies.

Introduction

Then came another project of a client moving to a top Cloud vendor where the need was ton run DAGs with the Apache Airflow workflow manager. In simple words, a DAG is a graphical representation of some jobs containing steps to execute with dependencies and parallelism as the very simple DAG shows below:
Let's describe this one:
      1/ Job starts
      2/ Step 1 runs
      3/ Step 2 and Step 3 run in parallel after Step 1 finishes
      4/ Step 4 has to wait for Step 2 and Step 3 to finish before being run
      5/ Job ends
This is indeed a very simple one and in my case it was jobs with hundreds of steps and dependencies to run; that big and complex that no monitor in the market could fully show it on one screen as you may find some more complex examples here.

Airflow being too slow running these complex DAGs (as per what I read here and there, it seems that complex dependencies and a large number of tasks is an Airflow known limitation), the adventure started for me with the below requirements:
  • A JSON file contains some jobs with many steps with dependencies
  • Execute them as fast as possible in parallel when possible

Challenge accepted !

I then started to have a look at the JSON file, becoming familiar with the dependencies structures, etc ... and started to think about coding that "manage dependencies and parallelism" feature.

make

make is a command Stuart Feldman started to develop in April 1976 inspired by the experience of a coworker who, in futilely debugging a program, got the executable accidentally not being updated with changes (more on the history here). He then created make which, based on makefiles is able to execute random code based on dependencies and parallelism to speed up the process. This looked to be exactly what I was looking for. Also, make is shipped with any Unix systems and is very light (compared to other tools which need GB of software).

A first makefile

make executes makefiles which are text files containing random code to execute and dependencies; let's have a look at a simple step syntax:
step_name: step_1 step_2
         some_code_to_execute.sh
Where:
  • step_name is the name of a step
  • step_1 step_2 are some steps which are step_name dependens on; in this example, step_name can only be executed after step_1 and step_2 have been executed (and successful -- if you wish)
  • some_code_to_execute.sh: some random code to execute
Easy, right ? now let's have a look at what could be the above DAG screenshot makefile:
done: the_end
step1:
     step1.sh
step2: step1
     step2.sh
step3: step1
     step3.sh
step4: step2 step3
      step4.sh
the_end: step4
This short makefile represents the DAG shown earlier; a new done keyword appears here, it is to define what label is the end of the makefile which I named the_end and which depends on step4.

Parallelism

-j tells make to start the steps with as many parallelism as possible (obviously respecting the dependencies) which is the number of CPU of the machine where make is executed if the machine is not already overloaded. In my example, step2 and step3 can be executed in parallel, not the other steps as enforced by the dependency tree. And this is managed automatically by make for more than 40 years, awesome, right ?

If in any case you would like to modify the parallelism degree (and why not run everything serially), feel free to experiment the -j option with different values.

A first execution

Let's execute this first makefile; I will just add an echo before the step*.sh scripts as they don't exist and the purpose of this is just to show how make works, a date when a step starts and finishes and I will sleep few seconds during each step -- the makefile now looks like this:
$ cat mymakefile
done: the_end
step1:
        @echo `date`": Executing step1.sh"
        sleep 1
        @echo `date`": End step 1"
step2: step1
        @echo `date`": Executing step2.sh"
        sleep 4
        @echo `date`": End step 2"
step3: step1
        @echo `date`": Executing step3.sh"
        sleep 10
        @echo `date`": End step 3"
step4: step2 step3
        @echo `date`": Executing step4.sh"
        sleep 2
        @echo `date`": End step 4"
the_end: step4
$
Now, execute it:
$ make -f mymakefile -j
Wed Nov 20 04:13:26 UTC 2019: Executing step1.sh
sleep 1
Wed Nov 20 04:13:27 UTC 2019: End step 1
Wed Nov 20 04:13:27 UTC 2019: Executing step3.sh
sleep 10
Wed Nov 20 04:13:27 UTC 2019: Executing step2.sh
sleep 4
Wed Nov 20 04:13:31 UTC 2019: End step 2
Wed Nov 20 04:13:37 UTC 2019: End step 3
Wed Nov 20 04:13:37 UTC 2019: Executing step4.sh
sleep 2
Wed Nov 20 04:13:39 UTC 2019: End step 4
$
With this execution, you can validate the parallelism (thanks to the -j option -- more on that below) with step 2 and step 3 and the dependencies (step 2 and step 3 start after step 1 is done and step 4 wait for step 3 to finish even if step 2 is faster).

Error codes

As any Unix tool, make will react on code execution return codes:
  • If the return code of a step*.sh script is 0, make continues with the next steps
  • If the return code of a step*.sh script is different than 0, make stops executing the next steps and stop here
  • This last behavior can be modified with the -k (keep going) option

A first implementation

This small first test confirmed that make was the tool I needed: easy, light, standard, robust and well documented.
I then just had to read the JSON configuration file containing all the jobs, steps and dependencies, generate a makefile with these steps and dependencies for a specific job and ... execute it !
Here is my first code after few hours working on this (it is draft because it was a draft :)). Note that I have added some random sleep after each step to test and make sure that the parallelism and dependencies were working as expected -- and they were.

The final implementation

The previous ~100 lines script shows a fist implementation as a proof of concept, the final implementation is more complete (but still less than 1000 lines of code), it generates ~ 2000 lines makefiles with steps calling a shell script to execute SQLs against bigquery, with the below features:
  • There is not one JSON file but two JSON files; one with the main steps dependencies and one with the scripts that each step contain; each step having many scripts to execute and they also have dependencies amongst themselves

  • Each script has to be executed against bigquery, they are usually hundreds of lines generated SQLs with parameters. The values of these parameters are in a YAML file. The script dynamically replaces the parameters with the correct values before each execution against bigquery.

  • Some date work is also made to execute the SQLs by time interval

  • The time interval can be customised so you can divide the work to be done like executing SQLs with date starting from Jan 1st 2019 to the current date by month, week, etc ... obviously, start date, end date and interval can be any date

  • I have implemented a "disruptor" mechanism where you can stop an execution from outside the script, bigquery, Airflow, etc ...

  • I have implemented a rerun mechanism to be able to rerun a failed job execution with the exact same parameters skipping the steps or the whole DAGs which were previously successfully executed

  • The logs are shown on the screen, in logfiles and are also inserted row by row in a MYSQL database (it is too slow to do this in bigquery and you will hit table quota insertion very quickly)

  • The logs being inserted row by row when a step and/or a substep is executed, the execution can be followed live, we have designed a nice dashboard with statistics on each step, DAGs shown on a graphical manner with green color for an executed step, grey clor for a step being executed, a red step for a failed step, etc ... then the client can follow the execution of any complex DAG live in a nice and graphical way

  • I had to implement a wait mechanism as bigquery is eventually consistent so I have to wait (and then slow down what I wrote to execute jobs as fast as possible - world is crazy right ? :)) 1 second (this is a parameter and then can be changed) after some jobs to let the eventual consistency to happen and then a next step to find the correct data to work with (the logs being very detailed, we can calculate how long we slept to accomodate this eventual consistency then how much this eventual consistency "costs" per job, DAG etc ...)

  • As AirFlow was the tool designed to do this job and is a scheduler, it is used to start the script and then show a green light if it is successful and a red one if there is a failure. Also, you can follow the logs of the script from AirFlow; it is indeed a technical output for technicians and the client will prefer using the nice GUI showing graphical DAGs coloring themselves when steps are executing and executed

  • Everything previously cited is aaP ("as a Parameter" to mimic the "as a Service" :)) and can then be modified very easily in Airflow before any job execution if needed; everything also has a default value, only the name of the job to execute is a mandatory parameter

  • We can exclude some DAGs from an execution

  • We can run prescripts and postscripts if needed

  • . . . and more . . .

It is also worth mentioning that many features implemented were not present in the first too slow Airflow design and my tool runs the DAGs at least 10 times faster than Airflow.


Thanks for reading, I hope you enjoyed this blog as much as I enjoyed achieving this challenge with these old buddies make and awk !! humbly continuing the Unix ethos: printable, debuggable, understandable stuff.


No comments:

Post a Comment

CUDA: Getting started on Google Colab

While getting started with CUDA on Windows or on WSL (same on Linux) requires to install some stuff, it is not the case when using Google...