An Unknown DBA blog: Some bash tips -- 12 -- Variables Manipulation

This blog is part of a bash tips list I find useful to use on every script -- the whole list can be found here.

Now that we know how and why we need to protect and quote our variables in bash, it is now time to explore some more advanced Variables Manipulation techniques which are basically powerful stuff which can happen between these bash curly brackets (aka braces): {}. But hang on, it is not just a cool stuff we can use to look geek, it is also an incredible way of making our code far more performant and scalable.

Variable length

A first thing we can do is to know the length of a variable by using # in front of the name of the variable inside the curly brackets:

$ var="12345678"
$ echo "${#var}"
8
$ if [[ "${#var}" -gt 7 ]]; then echo "This is a very long variable !"; fi
This is a very long variable !
$

Variables Trimming

We can also very easily trim a variable with :position:length; for example, to only show the first character of a variable:

$ echo "${var}"
12345678
$ echo "${var:0:1}"
1
$ echo "${var::1}"  <== the "0" can be ignored
1
$

And using the same principle, you can use a negative value to start from the end of the variable:

$ echo "${var::-1}"
1234567    <== remove the last character
$ echo "${var:2:-2}"
3456       <== remove the 2 first characters and the 2 last ones
$

And let's say you are interested in the 2 first and the 2 last characters and not in anything in between:

$ size="${#var}"
$ echo "${var:0:2}${var:$size-2:2}"
1278
$

Variables Replacement

The syntax ${var/pattern/replacement} can be used to easily replace a pattern in a string; and a double slash would be used for a global substitution ${var//pattern/replacement}:

$ var="ABcd----ABcd----ABcd"
$ echo "${var/B/Z}"
AZcd----ABcd----ABcd  <== simple substitution here, only the first B has been replaced by Z
$ echo "${var//B/Z}"  <== note the double shash // here
AZcd----AZcd----AZcd  <== global substitution here, all the B have been replaced by Z
$

If we want to replace only what starts or ends a variable, we would use ${var/#pattern/replacement} and ${var/%pattern/replacement}; let's see a couple of examples:

$ echo "${var/#A/Z}"
ZBcd----ABcd----ABcd  <== Starting A has been replaced
$ echo "${var/%d/Z}"
ABcd----ABcd----ABcZ  <== Ending d has been replaced
$

There is no Regular expressions here but extglob which stands for extended globbing and which has a different syntax than regexp but it is easy to catch up with this syntax (I found this blog clear on the subject):

$ echo "${var//[Ad]/Z}"
ZBcZ----ZBcZ----ZBcZ
$ echo "${var//B?/Z}"
AZd----AZd----AZd
$ var="00000000012340000000"
$ echo "${var/#+(0)/}"
12340000000      <== Remove the leading zeros
$ echo "${var/%+(0)/}"
0000000001234    <== Remove the trailing zeros
$

Uppercase and Lowercase

Another thing I use a lot especially when getting parameters from the command line through getopt[s?] is upper and lowercase; the easy syntaxes ,, ,,, ^ and ^^ are used for this purpose:

$ VAR="ABCD"
$ echo "${VAR,}"
aBCD  <== first character in lowercase
$ echo "${VAR,,}"
abcd  <== all the characters in lowercase
$ var="abcd"
$ echo "${var^}"
Abcd  <== first character in uppercase
$ echo "${var^^}"
ABCD  <== all the characters in uppercase
$

Performances and scalability

I know what you think at this point: "ok Fred this is cool but this is many new syntaxes to learn and I can already do all of that with sed, awk or tr as I have always been doing !"
As an example is worth 10k words, let's use a simple lowercase example to see if learning all of these syntaxes are worth it or not (note that there maybe other ways to lowercase a variable, these ones are the ones I see the most often):

$ A="ABCD"
$ time echo "${A,,}"
abcd
real    0m0.001s   <==
user    0m0.000s
sys     0m0.000s
$ time echo "${A}" | tr '[:upper:]' '[:lower:]'
abcd
real    0m0.017s  <==
user    0m0.008s
sys     0m0.005s
$ time echo "${A}" | awk '{print tolower($1)}'
abcd
real    0m0.075s  <==
user    0m0.008s
sys     0m0.042s
$

This is still small times, so no big deal ? well, let's do that 10k times:

$ time for i in $(seq 1 10000); do echo "${A,,}" > /dev/null; done
real    0m0.352s  <==
user    0m0.285s
sys     0m0.066s
$ time for i in $(seq 1 10000); do echo "${A}" | tr '[:upper:]' '[:lower:]' > /dev/null; done
real    0m46.088s  <==
user    0m39.311s
sys     0m18.098s
$ time for i in $(seq 1 10000); do echo "${A}" | awk '{print tolower($1)}' > /dev/null; done
real    1m8.847s  <==
user    0m50.360s
sys     0m29.844s
$

This now seems to be very much of a big deal, right? from a third of a second with the ${A,,} syntax to more than 1 minute for awk! And look at that system footprint, the difference is huge!

If a 10k loop to lowercase a variable is not very realistic to you, imagine a script using 10 variables, trimming pieces of variables, removing leading zeros or leading spaces, putting some variables in lowercase, some in uppercase and that script is executed on 1000 servers.. this looks more realistic, right?

Then the global system footprint, the worse performances and the non scalability of tr or awk by non using these simple bash variables manipulation features woul be a real performance killer and if you think about it, more system resources = more electricity used which would also make you partly responsable of the Global Warming! :)

So jump on your keyboards and modify your scripts to uses these bash variables manipulation features to reduce Global Warming!

< Previous shell tip / Next shell tip >

5 comments:

AnonymousFebruary 11, 2022 at 4:48 AM
apples, oranges and other fruits

the times you see is because you are doing exec() 10k times, and while for "script using 10 variables" your argument is valid, for mass editing (i.e. "10k loop to lowercase a variable") it's better to do it idiomatically:

$ A="ABCD"

$ time for i in $(seq 1 10000); do echo "${A,,}" > /dev/null; done

real 0m0.142s
user 0m0.110s
sys 0m0.031s

$ time for i in $(seq 1 10000); do echo "${A}" | tr '[:upper:]' '[:lower:]' > /dev/null; done

real 0m11.122s
user 0m11.535s
sys 0m4.022s

$ time for i in $(seq 1 10000); do echo "${A}" | awk '{print tolower($1)}' > /dev/null; done

real 0m21.186s
user 0m17.914s
sys 0m7.817s

but then:

$ time for i in $(seq 1 10000); do echo "${A}" ; done | tr '[:upper:]' '[:lower:]' > /dev/null

real 0m0.076s
user 0m0.067s
sys 0m0.047s

$ time for i in $(seq 1 10000); do echo "${A}" ; done | awk '{print tolower($1)}' > /dev/null

real 0m0.095s
user 0m0.090s
sys 0m0.061s

So, while variables manipulation is very useful in bash, and especially with a complex script with a lot of string manipulation it is a time-saver, for mass data editing when you don't do unnecessary exec()s, the results might be different.

Your "1000 servers" example is misleading btw, as they run in parallel. :)

Note, I'm all for reducing global warming.
AnonymousFebruary 11, 2022 at 5:05 AM
just for the sake of completeness, let's compare Cavendish with Gros Michel:

$ time echo | awk -v "a=${A}" 'END { for (i=0;i<10000;i++) {print tolower(a)}}' > /dev/null

real 0m0.015s
user 0m0.009s
sys 0m0.008s

Again, usage in scripts of variable manipulation is good, just your examples can be a bit misleading.
innerwestcleanersMarch 22, 2022 at 8:56 AM
You've provided some very useful information about . I'm glad I came into this article because it provides a lot of important information. Thank you for sharing this story with us.