truss debugging

truss (on Linux: strace) is a really handy tool. According to it’s manual page:

The truss utility executes the specified command and produces a trace of the system calls it performs, the signals it receives, and the machine faults it incurs.

So it’s a good thing if you’re a bit familiar with system calls, which are made from a userspace programm to the kernel of your operation system. Ok, that sounds a little scary, but it is very heplfull, when you have a programm, which doesn’t work as expected, but doesn’t provide any useful information (even with increased loglevel). truss outputs the systemcall, it’s parameters (usually limited to about 20-30 characters, so it will fit on a single output line) and the return value of the system call.

Of course you don’t see the internal processing of an process (eg. for pure calculations there’s no need to use kernel functions), but very often in operation the problem isn’t the internal processing, but missing files, files in wrong directories, missing access rights and so on. These things cause problems quite often, and an experienced Unix administrator uses then truss. In most times the adminstrator just focusses on some kernel calls, which do I/O on the filesystem.

I use this tool quite often and I’m surprised, how much problems I found and resolved just by looking at the in- and output system calls; recently we had a hanging authoring cluster which doesn’t respond any more, even after having restarted both nodes. The logs showed nothing obvious, only truss revealed, that there’s was a “lock”-file, which probably was a left-over from a operation (it must have been a race-condition in the code, we haven’t found yet). And the cluster-sync of both processes waited for this lock-file to disappear, so they could write now theirselves. And of course it doesn’t, so the processes hung. So, truss showed me that these processes were repeatedly testing the existence of this file. I removed it and voila, the cluster sync went back to operation.

Using truss worked so often, that in most cases, where I/O is affected, I use truss first, even before I increase the loglevel. Colleagures also have these procedure when they have problems with their applications. We use the term “truss debugging” 😉

Oh, if you have performance issues with I/O, truss is also a nice tool. You can check, on what files I/O is performed, and with what read/write ratio. You can even see, what buffer size is used. About 3 years ago we filled a bug on CQ 3.5.5 because CQ read its byte by byte; Day provided a hotfix and the startup time went down from 1 hour to about 15 minutes. The solution was to read the in 2k blocks (or 4k, doesn’t matter as long it isn’t byte-wise); this issue was only noticable by either reading the source code or invoking truss.