Twitter

Some bash tips -- 15 -- comm

This blog is part of a shell tips list I find useful to use on every script -- the whole list can be found here.

Unix-like systems are full of text file and we very ofenly end up with 2 lists of stuff we need to find which one of the first list is not in the second list of which elements are in both lists. The best way to achieve this in a glimpse is to use the comm command.
Let's use 2 files as an example; these are the users with uid < 10 I took from an OCI instance and where randomly removed some users for the sake of explaining comm.
[fred@onehost]$ cat list1
root
bin
adm
lp
sync
shutdown
halt
mail
[fred@onehost]$ cat list2
root
bin
daemon
adm
lp
halt
mail
[fred@onehost]$
Before starting comming the files, we need to know what is AFAIK, the only comm requirement: the files have to be sorted; if you do not sort the files, you will get the below error (unless you specify --nocheck-order but you'll have a wrong output so not sure it is worth mentioning this option):
comm: file 2 is not in sorted order
As a side note, the way to correctly sort a file is the below one:
[fred@onehost]$ sort -o list1 list1
Indeed, reading and sorting the same file is NOT a good idea, do NOT do as below:
[fred@onehost]$ cat list1 | sort > list1  <== do NOT do that
Now that our files are sorted, we can comm them; let's start with no option:
[fred@onehost]$ comm list1 list2
                adm
                bin
        daemon
                halt
                lp
                mail
                root
shutdown
sync
[fred@onehost]$
The default output shows 3 columns:
  • The first column are the elements which are in the first file only
  • The second column are the elements which are in the second file only
  • The third column are the elements which are in both files

As outlined in this comment below, we can indeed also dynamically sort the files when executing the comm command (note that this won't modify the original files):
[fred@onehost]$ comm <(sort list1) <(sort list2)
                adm
                bin
        daemon
                halt
                lp
                mail
                root
shutdown
sync
[fred@onehost]$
You can also use the sort -u option to also remove the duplicates while sorting the files;
[fred@onehost]$ comm <(sort -u list1) <(sort -u list2)
                adm
                bin
        daemon
                halt
                lp
                mail
                root
shutdown
sync
[fred@onehost]$
I am no big fan of this output even if I can think of some use for it; you can for example use the --total option to show a count of each column and also the --output-delimiter option which you can for example set to semi-column to get a CSV-like output to paste in a spreadsheet tool for a nice show off to management:
[fred@onehost]$ comm list1 list2 --total --output-delimiter ";"
;;adm
;;bin
;daemon
;;halt
;;lp
;;mail
;;root
shutdown
sync
2;1;6;total
[fred@onehost]$
Now, what I consider to be the most useful comm options. Their use is a bit counter intuitive as the options hide information instead of showing information (I did this kind of thing in rac-status, it is very powerful):
  • -1: do not show column 1
  • -2: do not show column 2
  • -3: do not show column 3
This for example means that -23 will:
  • hide column 2
  • hide column 3
=> then show only column 1 (the elements which are only in the first file); let's illustrate with an example:
[fred@onehost]$ comm -23 list1 list2
shutdown
sync
[fred@onehost]$
Following the same principle, -12 with only show the 3rd column which are the elements commun to both files:
[fred@onehost]$ comm -12 list1 list2
adm
bin
halt
lp
mail
root
[fred@onehost]$
And yes, -123 would show nothing at all:)
[fred@onehost]$ comm -123 list1 list2
[fred@onehost]$
And this works very very fast on very big files; see how easy it is?

Now that you know comm, think back to how you were doing this before? yeah... no, there is no coming back from comm, comm is awesome !


$lt; Previous shell tip / Next shell tip coming soon >

1 comment:

  1. in bash: comm <(sort list1) <(sort list2)

    ReplyDelete

CUDA: Getting started on Google Colab

While getting started with CUDA on Windows or on WSL (same on Linux) requires to install some stuff, it is not the case when using Google...