An Unknown DBA blog: So you have removed a trace file and your FS is still filling (very fast)

So you have a FS filling very fast on a server, someone connects, finds some big trace files growing very fast and delete them which frees some space on the FS and few minutes later, the FS is back to a dangerous threshold but you are unable to see the trace files any more (hey no, they have been deleted !) and a few moment later, your FS is full which is not what we want. Oftently, people panic and finish by rebooting the application(s) and/or server(s) supposed to fill these trace files -- which is also not what anyone wants but this is often seen as the best (less worse ?) way of "saving the system".
cat /dev/null > tace_file would obviously have been better here but what is done is done and you have to deal with the situation, you cannot unrm the file. Having said that, in a specific case I faced, it was due to a 19c oracle bug (what ? an Oracle bug ? yeah it happens from time to time . . . :D) where the oracle processes are tracing too much (you bet !) and only a very aggressive (every minute) cronjob to empty these trace files may fix the issue (if this can be considered as a fix) so the below solution still applies the time you can organize applying a patch (which would obviously more or less need a downtime . . .).

Back to the situations df can show that the FS is full:

[root@linux01]# df -h /u01
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbOra1  237G  227G     0 100% /u01
[root@linux01]#

but not du (you can do the maths, according to du, the FS is far from being full):

[root@linux01]# du -sh *
144G    app
4.0K    crashfiles
16K     lost+found
612M    oracle.ExaWatcher
547M    tmp
[root@linux01]#

Before jumping into the solution, we need to understand how writing into a file works on Un*x systems and what file descriptors are.

Let's start with this simple ls example which shows a file and when redirected to /dev/null shows nothing -- obviously, the output is redirected to /dev/null.

$ ls
a_file
$ ls > /dev/null
$

Now let's do the same as above but with a syntax error:

$ ls -z > /dev/null
ls: invalid option -- 'z'
Try 'ls --help' for more information.
$ ls -z > /dev/null 2>&1
$

Here, we can see that the error is not redirected to /dev/null when doing > /dev/null but only when doing > /dev/null 2>&1 !? This is because > /dev/null redirects the "1" file descriptor (stdout -- your screen) to /dev/null but not the "2" file descriptor which is stderr and also your screen by default.

If everyone knows more or less these special 1 and 2 file descriptors, you can create and use any other file descriptors and read/write from it; lets create a "3" file descriptor and write in it:

$ exec 3>fd3.log
$ echo "to fd3" >&3
$ cat fd3.log
to fd3
$

To kind of simulate a process (an Oracle process or any other) writing into a file through its file descriptor, let's run this while true command inserting the current date every 5 seconds:

$ while true; do echo $(date)" to fd 3 again" >&3; sleep 5; done

The trace file is well updated every 5 seconds:

$ tail -f fd3.log
to fd3
Thu Mar 18 14:28:03 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:08 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:13 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:18 AEDT 2021 to fd 3 again

Now let's remove that file:

$ ls -ltr | grep fd
-rw-r--r-- 1 fred fred 535 Mar 18 14:28 fd3.log
$ date; rm fd3.log
Thu Mar 18 14:28:28 AEDT 2021   <== the date I have removed the file
$ ls -ltr fd3.log
ls: cannot access 'fd3.log': No such file or directory    <=== the file does not exist anymore
$

And if I go back to my session tailing the logfile:

$ tail -f fd3.log
Thu Mar 18 14:28:03 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:08 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:13 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:18 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:23 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:18 AEDT 2021 to fd 3 again  
Thu Mar 18 14:28:23 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:28 AEDT 2021 to fd 3 again  <=== file has been deleted on Thu Mar 18 14:28:28 AEDT 2021
Thu Mar 18 14:28:33 AEDT 2021 to fd 3 again  <=== still things are inserted in it !
Thu Mar 18 14:28:38 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:43 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:48 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:53 AEDT 2021 to fd 3 again
Thu Mar 18 14:28:58 AEDT 2021 to fd 3 again
Thu Mar 18 14:29:03 AEDT 2021 to fd 3 again
Thu Mar 18 14:29:08 AEDT 2021 to fd 3 again

I can see that my file is still being filled by the while true I started !? this is because the file descriptor is still opened and then the process is still writing in it -- but you cannot see the file as his physical enveloppe (wow this is becoming mystic ! :D) does not exist any more -- this is the situation I was describing earlier.

Indeed, the full process of writing into a file is:

1/ open the file (open a file descriptor)
2/ write in it (write into a file descriptor)
3/ close the file (close the file descriptor)

Now that we know this and as my example as shown, it looks obvious that we have a (far) better (and easier and safer and better . . . name it . . .) solution than restarting the application or rebooting the database or server, right ?

Let's move further into the investigation (as you also most likely don't know which trace files have been removed) and see the correct solution later on.

First of all, we need to find the biggest deleted files thanks to lsof +aL1 (see man lsof):

[root@linux01 u01]# lsof +aL1 /u01 sort -nk 7 | tail -10
oracle    263797 oracle  271w   REG  252,2    4915555     0 1059311 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_263797.trm (deleted)
diskmon    25690 oracle  363w   REG  252,2    5736792     0  942591 /u01/app/oracle/diag/crs/linux01/crs/trace/diskmon_68.trm (deleted)
oracle    316046 oracle   68w   REG  252,2    8536422     0 1488777 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_316046.trc (deleted)
oracle    116391 oracle   90w   REG  252,2    9324380     0 4873097 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_116391.trc (deleted)
oracle    116379 oracle   89w   REG  252,2    9621504     0 4876403 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_116379.trc (deleted)
oracle    116340 oracle   86w   REG  252,2    9664646     0 4873104 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_116340.trc (deleted)
diskmon    25690 oracle  362w   REG  252,2   27203987     0  942489 /u01/app/oracle/diag/crs/linux01/crs/trace/diskmon_68.trc (deleted)
oracle    263797 oracle  270w   REG  252,2   36228291     0 1059304 /u01/app/oracle/diag/rdbms/db01/SID01/trace/SID01_ora_263797.trc (deleted)
ora_p02l_ 371508 oracle   49w   REG  252,2  209219823     0 5767946 /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trm (deleted)
ora_p02l_ 371508 oracle   48w   REG  252,2 3959405542     0 5767945 /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trc (deleted)
[root@linux01 u01]#

We can find from /proc the process ID and the file descriptors of these files:

[root@linux01 u01]# ls -ltr /proc/371508/fd | grep SID02
l-wx------ 1 oracle dba 64 Nov  4 21:32 49 -> /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trm (deleted)
l-wx------ 1 oracle dba 64 Nov  4 21:32 48 -> /u01/app/oracle/diag/rdbms/db02/SID02/trace/SID02_p02l_371508.trc (deleted)
[root@linux01 u01]#

We then just have to apply the 3rd step on how to write into a file: close these file descriptors ! then the process would no mode write into it -- easy !

If it is easy to do in a program you write, we have here to close a file descriptor into another running PID (which is still doing tons of stuff we do not want to stop). For this purpoise, we will be using gdb which is the GNU debugger (default on any Linux):

[root@linux01 u01]# gdb -p 371508
. . .
(gdb) call close(48)
$1 = 0
(gdb) call close(49)
$2 = 0
(gdb) quit
A debugging session is active.

        Inferior 1 [process 371508] will be detached.

Quit anyway? (y or n) y
Detaching from program: /u01/app/oracle/product/19.0.0.0/db_1/bin/oracle, process 371508
[Inferior 1 (process 371508) detached]
[root@linux01 u01]#

We can verify that these file descriptors have been well closed

[root@linux01 u01]# ls -ltr /proc/371508/fd | grep SID02
[root@linux01 u01]#

You have then properly closed these file descriptors, no more space issue on your system, problem fixed !

An Unknown DBA blog

Twitter

So you have removed a trace file and your FS is still filling (very fast)

2 comments:

OCI: Datapump between 23ai ADB and 19c ADB using database link

Linkedin

Twitter