Basic performance tuning: Caching

November 4, 2009 by jhoh228

Many CQ installations I’ve seen start with the default configuration of CQ. This is in fact a good decision, because the default configuration can handle small and middle installations very well. And additionally you don’t have to maintain a bunch of configuration files and settings; and finally most CQ hotfixes (which are delivered without the QA) are only tested with default installations.

So when you start with your project and you have a pristine CQ installation, the performance of both publishing and authoring instances are usually very good, the UI is responsive, page load times in the 2-digit miliseconds. Great. Excellent.

When your site grows, when the content authors start their work, you need to do your first performance and stress tests using numbers provided by the requirements (“the site must be able to handle 10000 concurrent requests per second with a maximal response time of 2 seconds”). You either can overcome such requirements by throwing hardware on the problem (“we must use 6 publishers each on a 4-core machine”) or you just try to optimize your site. Okay, let’s try it with optimization first.

Caching is a thing which comes to mind first. You can cache on several layers of the application, be it application level (caches builtin into the application, like the outputcache of CQ 3 and 4), the dispatcher cache (as described here in this blog), or on the users system (using the browser cache). Each cache layer should decrease the number of requests in the remaining caches, so that in the end only the requests get through, which cannot be handled in a cache, but must be processed in CQ. Our goal is to move the files into a cache which is nearest to the enduser; then loading of these files is faster than if the load is performed from a location which is 20 000 kilometers away.

(A system engineer may also be interested in that solution, because it will offload data traffic from the internet connection. Leaves more capacity for other interesting things …)

If you start from scratch with performance tuning, grasping for the low-hanging fruits is the way to go. So you start into an iterative process, which contains of the following steps:

  1. Identify requests which can be handled by a caching layer which is placed nearer to the enduser.
  2. Identify actions, which allows to cache these requests in a cache next to the user.
  3. Perform these actions
  4. Measure the results using appropriate tools
  5. Start over from (1)

(For a more broader view to performance tuning, see David Nueschelers post on the Day developer site)

As an example I will go through this cycle on the authoring system. I start with a random look at the request.log, which may look like this:

09/Oct/2009:09:08:03 +0200 [8] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:06 +0200 [8] <- 200 text/html; charset=utf-8 3016ms
09/Oct/2009:09:08:12 +0200 [9] -> GET / HTTP/1.1
09/Oct/2009:09:08:12 +0200 [9] <- 302 - 29ms
09/Oct/2009:09:08:12 +0200 [10] -> GET /index.html HTTP/1.1
09/Oct/2009:09:08:12 +0200 [10] <- 302 - 2ms
09/Oct/2009:09:08:12 +0200 [11] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:13 +0200 [11] <- 200 text/html; charset=utf-8 826ms
09/Oct/2009:09:08:13 +0200 [12] -> GET /libs/wcm/welcome/resources/welcome.css HTTP/1.1
09/Oct/2009:09:08:13 +0200 [12] <- 200 text/css 4ms
09/Oct/2009:09:08:13 +0200 [13] -> GET /libs/wcm/welcome/resources/ico_siteadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [14] -> GET /libs/wcm/welcome/resources/ico_misc.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] -> GET /libs/wcm/welcome/resources/ico_useradmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] <- 200 image/png 8ms
09/Oct/2009:09:08:13 +0200 [16] -> GET /libs/wcm/welcome/resources/ico_damadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [16] <- 200 image/png 5ms
09/Oct/2009:09:08:13 +0200 [13] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [14] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [17] -> GET /libs/wcm/welcome/resources/welcome_bground.gif HTTP/1.1
09/Oct/2009:09:08:13 +0200 [17] <- 200 image/gif 3ms

Ok, it looks like that some of such requests must not be handled by CQ: the PNG files and the CSS files. These files usually never change (or at least change very seldom, maybe on a deployment or when a hotfix is deployed). But for the usual daily work of an content author they can be assumed to be static, but we must of course provide a way that we enable the authors to fetch a new one, when an update to one them occurs. Ok, that was step 1: We want to cache the PNG and the CSS files which are placed below /libs.

Step 2: How can we cache these files? We don’t want to cache them within CQ (that wouldn’t bring any improvement), so remains dispatcher and browser cache. In this case I recommend to cache them in the browser cache for 2 reasons:

  • These files are requested more than once during a typical authoring session, so it makes sense to cache directly in the browser cache.
  • Latency of the browser cache is ways lower than the latency of any load from the network.

As an additional restriction which speaks against the dispatcher:

  • There are no flusing agents for authoring mode, so we cannot use the dispatcher that easily. So in the case of tuning an authoring instance we cannot use the dispatcher cache.

And to make any changes to these files made on the server visible to the user, we can use the expiration feature of HTTP. This allows us to specify a time-to-live, which basically tells any interested party, how long we consider this file up-to-date. When this time is reached, every party, which cached it, should remove it from cache and refetch.
This isn’t the perfect solution, because a browser will drop the file from its cache and refetch it from time to time, although the file is still valid and up-to-date.
But there’s still an improvement, if the browser fetches this files every hour instead of twice a minute (when a page load occurs).

Our prognose is, that the browser of an authoring user won’t perform that much requests on files anymore; this will increase the rendering performance of the page (the files are fetched from the fast browsercache instead from the server), and additionally the load on the CQ will decrease, because it doesn’t need to handle that much requests. Good for all parties.

Step 3: We implement this feature in the apache webserver, which we have placed in front of our CQ authoring system and add the following statements:

<LocationMatch /libs>
ExpiresByType image/png "access plus 1 hour"
ExpiresByType text/css "access plus 1 hour"
</LocationMatch>

Instead of relying on file extensions we specify here the expiration by the MIME-type in these rules. The files are considered to be up-to-date for an hour, so the browser will reload these files every hour. This value should be ok also in case these files are changed once. And if everything fails, the authoring users can drop their browser cache.

Step 4: We measure the effect of our changes using 2 different strategies: First we observe the request.log again and check if these requests appear further on. If the server is already heavy loaded, we can additionally check for a decreasing load and an improved response times for the remaining requests. As a second option we take a simple use case of an authoring user and run it with Firefox’ Firebug extension enabled. This plugin can visualize how and when the load of the parts of a page happen, and display the response times quite exactly. You should see now, that the number of files requested over the network has decreased and the load of a page and all its emnbedded objects is faster than before.

So with an quick and easy-to-perform action you have decreased the page load times. When I added expiration headers to a number of static images, javascripts and css files on a publishing instance, the number of requests which went over the wire went down to 50%, the pageload times also decreased, so that even during a stress test the site still had a good performance. Of course, dynamic parts must be handled by their respective systems, but if we can offload requests from CQ, we should do this.

So as a conclusion: Some very basic changes to the system (some configuration adjustments to the apache config) may increase the speed of your site (publishing and authoring) dramatically. Such changes as described are not invasive to the system and are highly adjustible to the specific needs and requirements of your application.

Reporting application problems

October 28, 2009 by jhoh228

(I write this blog article with a certain background: In the Daycare ticketing system the support often needs to ask for additional information to start with an initial analysis of the issue report. This is a time-consuming task and increases the time-to-fix. So this a help for my colleagues at the Day support, but it should be also applicable for the contact with most enterprise support lines.)

Writing a bugreport is hard work. Many issue reporter often thinks, that the people, who are responsible for the application itself, just try to refuse to fix a bug and therefor ask questions and demand information, which are hard to deliver and which are absolutly clear to you. That these people don’t want to admit that their product has issues.

But these questions can often be easily answered, if you are well prepared.

Usually developers (and the support people, who work as first line for them) ask the following questions:

What software are you using, which versions, which additional fixes?
People often assume that these informations are already known to the support (especially if you’re dealing with N enterprise support); but these support lines often don’t track the software versions of their customers; and who knows, maybe you report an issue with a new version, which is currently only installed on your development systems.

Providing these informations on the opening of an issue report as a default informations helps the support to provide a quicker help. There’s no need to deal further with version information and ask for installed hotfix versions. At least one round of question – answer less.

A point for all software developers: Provide an facility to get all these informations without hassling with the package databases or registry of your systems. Keep these information automatically up-to-date when installing additional packages, fixes or enhancements.

What’s the impact of the reported issue?
Provide the impact of the problem, so the support can estimate the importance of the issue. A report on a wrongly documented feature gets another priority on the developers todo list than an hourly crashing ERP system.
Also provide the audience, which is affected by the issue. A non-working feature, which is offered as a vital part of your website is clearly more important than the same feature, if it’s non-functional for a small group of people; because for the latter is probably more easy to provide a workaround.
(It will probably super-important if these small group is the management, but that’s another topic …)

When was the issue spotted first?
This information may help to correlate the issue with other events; often problems get visible only under certain circumstances, which are not present at the start. This may be a system update (operating system, JVM, database, …), changed settings in the applications itself or just a heavier use of the system (more data, more users, higher peak usage). All these factors may increase the possibility that certain, yet unknown and unspotted problems get visible and harm your application.
(That’s the background of the famous quote “never change a running system”.)

It’s your task as an issue reporter to provide these information to the support. This information helps the support to focus on the impact of such changes, which very often reduces the amount of investigation dramatically.

So for example if you recently have just updated your Sun JVM due to security reasons from 1.5.0.8 to 1.5.0.11 and suddenly encounter spurious crashes, the developers may focus on the changes introduced by this JVM change and analyse their impact the application. Without this information you probably have to go through a long and painful analysis phase, when developers ask for all kind of dumps, JVM instrumentation and so on.

Is the problem reproducible
The question in which a developer is most interested in. If an issue can be reproduced it can be fixed. Because the developer can analyze the issue, understand the problem and then solve it, all without too much trial and error just to see the problem.
If an issue cannot be reproduced, a lot of information are not known. So maybe the problem occurs under conditions, which are there on your special system or with your special data. Trying to reproduce the issue on any other system is hard or impossible.
So this is one of the most important task of an issue reporter: Trying to provide a reproducable test case. If you are able to reproduce the issue, describe all the prerequisites and the steps to actually reproduce it. Be it a step-by-step documentation or by a little screencast, any appropriate format is welcome.

In the case of Day CQ the basis for testcases the playground/geometrixx application of a plain CQ installation can be used. So just install a plain CQ and make as few changes as possible to reprouce the problem.

If you can reproduce your problem on a plain CQ installation, you make the task of fixing your issue much more easy for Day. Time consuming analysis and making assumptions on a lot of parameters can be avoided then, and the developers may head directly to the issue itself.

Often you cannot reproduce the issue you want to report; either be it, that you don’t know the issue exactly (“my system just crashes”), or you cannot reproduce the problem, because it’s specific to a certain environment (“the crash only happens under heavy load; we couldn’t reproduce this crash using stress tests yet”). Then you need to provide as much information as possible.

Additional informations
Attach all available information (ok, not really _all_ information; only the one, which sounds usable, e.g. logfiles containing application specific logs, system dumps, threadumps for java applications, …) to your issue report.

If some special information is missing, the support will ask for it. But if you provide a certain standard set of information (depending on your application), this will be sufficient in 90%.

For a Day CQ installations these informations are the followings:

  • error.log of CQ
  • error.log of CRX
  • in case of performance problems: request.log, garbage collection log)
  • in case of performance problems and system lockups: threaddumps
  • in case of performance problems and out-of-memory-exceptions: threaddumps, heapdumps

Conclusion

For all these questions there are good reason why they are asked. I hope I showed you some of the background to understand these reasons.

So providing the right information directly from the start will reduce the time until you get support, which actually helps you in resolving your issue; or it can at least try to provide useful tips, which may help to establish workarounds. So in the longterm it helps both you as an issue reporter the support.

A good issue report

October 16, 2009 by jhoh228

(inspired by the How To Ask Questions The Smart Way by Eric S Raymond)

In the last months I encountered in several situations, that people brought up issues like “the site isn’t working” or “the site is slow”, without more information. If am responsible for the thing in question, this leaves me a hard task: I am expected to fix a problem, for which I don’t have any information. It sometimes even doesn’t exist, because a situation is perceived as problem, which doesn’t exist on my site, but either at the system of the guy reporting it or somewhere in between. But we don’t know.

People who get such reports tend to have different strategies: One strategy would be just to reject such issues because “I cannot reproduce it on my system; the website is responding query quickly and behaves fine” and throw to trash (move the complaint email to trash or close the ticket in the issue system). Other just ignore them (for the same reason), but don’t throw the reporting message into trash. A third approach is to request more information on this issue. That would be a good approach, but either the reporter do not react anymore (because they are just busy), the problem is gone in the meanwhile (for whatever reason), or they cannot provide the requested information anymore. Also not a satisfying solution.

So the only remaining solution is to force people to provide the required information with the initial issue report. If you ask people to describe their problem very closely and detailled, they like to provide this information to you, because they feel, that somebody really likes to solve their problem. But because they are not aware that (and which) information is needed for the resolution of an issue, they tend to provide no information at all.

So the goal is to have a list of things, which have to be provided by the reporter of an issue along with the issue. So in the end an issue report would look like this:

“At about 1:40pm on September 23st 2009 I requested the page http://www.abc.foo/a/b.html; I received only a damaged page, with some pictures missing. When I tried to login using my credentials (my username is hbt85), I received an internal server error page; I used the firefox browser installed on my corporate computer.”

Altough this report is very brief (and probably doesn’t much time to report), it contains valuable information, so a system administrator can immediately start to look for the problem and has a realistic chance to find traces of the described issue (in the logfiles, in system dump or in the application monitoring). Because it contains the following important information:

  • Who is the reporter? (Not only the eMail adress or the full name of the user, but also by providing the username in the affected system)
  • What system is affected? (given by the exact hostname)
  • When does the issue occur? (time and date)
  • What has the reporter done and what were the effects of it?

So the most time consuming task of every support is to qualify every incoming request to such a level, that a qualified guy can take a look at the system to identify and fix the issue. In the way to qualify such problems it very often turns out that the cause of the problem lies on the user side, being it either missing training on the system resulting in wrong usage, missing or incorrect documentation or just errors on the user-site. In our example it could be that the user just used a wrong URL, he should have used “http://www.abc.foo/a-new/b.html”, but he missed the mail announcing this change.

So a major job of every support organisation is to have a prepared, up-to-date list of relevant information, which are needed to resolve an issue. So if you are in a support organisation, provide a list of information pieces, which must be provided by a issue reporter. And if you’re a experienced user and want to report an issue and you want it really fixed, provide as many information as possible. There is no “too much information” regarding a specific problem.

Meta: new Job, Ignite 2009

October 11, 2009 by jhoh228

Although the times seem to be hard with the financial crisis, which hit many companies and decreased IT budgets, I decided to leave my old job. And since October 1st I am employed as an Solution Architect at Day Software and will be working in Frankfurt am Main (Germany). At the moment I am in Basel at the Barfüsserplatz office, get to know the people (hey, you really rock!) and the products. So I will be able to cover also the latest versions in this blog.

I will also visit the customer summit this year in Zurich (only thursday), meeting people and learning about our prodcuts :-) So see you there!

Traversing the content hierarchy

July 28, 2009 by jhoh228

When you play around with websites, you often get a good feeling how much “engineering” work and thoughts are put into the site. Think of things like SQL injection and shell escape injection, which were a problem a few years ago. If you encounter today a site which is vulnerable to such a problem, it’s either a problem of the budget (which is a lame excuse because modern frameworks avoid such problems) or the skill of the developer. An up-to-date site isn’t affected anymore by such problems.

A problem, which is clearly visible, but often not known to application developers and architects is the content structure, which is exposed to the user by the URL. Consider the following small example for a simple content structure:

/content/brand/en/home

for the startpage of a CQ-based website. The “home”-handle is the startpage and is thus called via

www.example.com/content/brand/en/home.html

So, where’s the problem? Well, most templates provide a kind of HTML representation of their content. So let’s try

www.example.com/content/brand/en.html

maybe also a structure handle such as the language node (which is often just used to differentiate between languages) does also provide a HTML representation of its content; so it could just render its child nodes as a dotted list.

So, what’s then? Does it harm, if you reveal, that you provide beside english also chinese content? No, it doesn’t. Most times it doesn’t. But when you already have fresh content ready, but not yet linked? It would appear in such a list. Or if you have “hidden content”, functions which are known only to a small group of people? Things, which aren’t secured by authentication and authorization. Suddenly someone has found your private data and could make use of it.

The trash can for functionality is often a folder named “tools”; developers tend to place everything there which doesn’t fit well into any other category. So you can find there contact forms, search functionality and other stuff. So what happens if you call

www.example.com/content/brand/en/home/tools.html

Does it also your show unused/crappy/new functions, which aren’t used in the website, but are still there? Because for convenience some developer thought, it would be cool to have all tools listed without major hassle (1 bookmark instead of 10). Bad idea, you just showed all your available tools to someone, who shouldn’t see them.

So check you your site, that strucutre nodes, which are only used to structure your content, cannot be rendered at all or don’t reveal any information, which could be useful for an attacker. Either return an empty page or (suggested) return the HTTP statuscode “403″ (access denied). Don’t reveal data when it isn’t necessary. A well-engineered site also takes care of such “attacks” and doesn’t reveal any data which could be of use for a potential attacker.

I’ve already done such tests on several CQ-based websites and found (beside some other things) a monitoring page (containing version information of used libraries) and also a hidden webspecial which was dedicated to a member of the webteam, heading for another location (hi, Katrin!). All of these information were public viewable (on a major corporate website!) just by playing around with path names and following then links.

Disk usage

July 6, 2009 by jhoh228

Providing enough free disk space is one of the most urgent task for every system administrator. Applications often behave strange when the disks runs out of free space and sometimes data gets corrupt if the application is badly written.

Under Unix the 2 tools to determine the usage of the disk are “du” (disk usage) and “df” (disk free). With “du” you can determine the amount of space a certain directory (with the files and subdirectories in it) consume. “df” shows on a global level the amount of used and free disk space per partition.

Given the case, that you give the same directory (which is also a mountpoint) to both “du” and “df” as parameters. This directory contains a full CQ application with content and versions. You will probably get different results. When we did it, “df” showed about 570 gigabyte used disk space, but “du” claimed, that only 480 gigabyte are used. We checked for hidden directories, open files, snapshots and other things, but the difference of about 90 gigabyte remained.

This phenomenon can be explained quite easy. “du” accumulates the size of files. So if a file has 120 bytes in size, it adds 120 to the sum. “df” behaves differently, it counts block-wise, which are the smallest allocation unit of a unix filesystem (today most blocks are 512 bytes large by default). Because the only one file per block is possible, the 120-byte file uses a full block, leaving 392 bytes unused in that block.

Usually this behaviour is not apparant, because the number of files is usually rather small (a few to some ten thousand) and they are large, so the unused parts are at max 1 percent of the whole file size. But if you have a CQ contentbus with several hundert thousands of files (content + versions) with a lot of small files, this part can grow to a level, where you’d ask your system administrator, where the storage is gone.

So dear system administrator, there’s no need to move your private backup off the server, just tell the business, that their unique content structure needs a lot of additional disk space. :-)

User administration on multi-client-installations

June 15, 2009 by jhoh228

Developing an application for a multi-client-installation isn’t only a technical or engineering quest, but also reveals some question, which affect administration and organisationial processes.

To ease administration, the user accounts in CQ are often organized in a hiearchy, so that users which are placed higher in the hierarchy, can administrate user which are lower in the hierarchy tree below them. Using this mechanism a administrator can easily delegate the administration of certain users to other users, which can also do adminstrative works for “their” users.

The problem arises when a user has to have rights in 2 applications within the same CQ instance and every application should have its own “application administrator” (a child node to the superuser user). Then this kind of administration is no longer possible, because it is impossible to model a hierarchy where neither application administrator user A has a parent or child relation to application administration user B nor A and B are placed in the hierarch higher than any user C.

I assume that creating accounts for different application but the same person isn’t feasible. That would be the solution which the easiest one from an engineering point of view, but this does contradict the ongoing move not to create for each application and each user a new user/password pair (single sign on).

This problem imposes the burden of user administration (e.g assigning users to groups, resetting passwords) to the superuser, because the superuser is the user, which is always (either by transition or directly) parent to any user. (A non-CQ-based solution would be to handle user related changes like password set/reset and group assignment outside of CQ and synchronize these data then into CQ, e.g. by using a directory system based on LDAP.)

ACLs, access to templates and workflows should be assigned only using groups and roles, because these can be created per application. So if an application currently is based on a user hierarchy and individual user rights it’s hard to add a new application using the same user.

So one must make sure, that all assignments are only based on groups and roles, which are created per application. Assigning individual rights to a single user isn’t the way to go.

Being a good citizen in multi-client-installations

June 5, 2009 by jhoh228

Working in a group requires a kind of discipline from people, which some are not used to. I remember to a colleague, who always complained about his office mate who used to shout at the phone, even in normal talks. If people work together and share ressources, everyone expects to be cooperative and not to trash the work and morale of their team members.

The same applies to applications; if they share ressources (because they may run on the same machine), they should release the ressources if they are no longer needed, and should only claim ressources, if they’re needed at all. Consuming all CPU because “the developer was to lazy to develop a decent algorithm and just choose the brute-force solution” isn’t considered a good behaviour. It’s even harder if these applications are contained within one process, so a process crash not only affects application A, but also B and C. And then it doesn’t matter, that these are well-thought and perfectly developed.

So if you plan to deploy more than one appliction to a single CQ instance, you should take care, that the developers were aware of this condition and they had it in their mind. Because the application does no longer control the heap usage on its own (on top of the heap consumption of CQ itself), but must share it with other applications. It must be programmed with stability and robustness in mind, because unknown structures and services may change assumptions about timing and sizes. And yes, a restart also affects the others. So restarting because of application problems isn’t a good thing.

In general an application should never require a restart of the whole JVM; only when it comes to necessary changes to JVM parameters, it should be allowed. But all other application specific settings should be changable through special configuration templates which are evaluated during runtime, so changes are picked up immediately. This even reduces the amount of work for the system administrator, because changing such values can be delegated to special users using the ACL mechanism.

CQ and multi-client capability

June 2, 2009 by jhoh228

In large companies there’s a need for a WCMS like Day CQ in many areas; of course for the company webpage, but also for some divisions offering services to internal or external customers, which either implement their application on top of Day CQ, use CQ as a proxy for other backend systems or just need a system for providing the online help.

Some of these systems are rather small, but nevertheless the business needs a system, where authors can create and modify content according to the business’ needs. If you already have CQ and knowledge in creating and operating applications, it’s quite natural to use it to satisfy these needs. The problem is, that building up and operating a CQ environment isn’t cheap for only one application. Very often one asks if CQ is capable of hosting several application within one CQ installation to leverage effets of scale. CQ is able to handle hundreds of thousands of content handles per instance, so it’s able to host 5 applications with 20 thousands handles each, isn’t it?

Well, that’s a good question. By design CQ is able to host multiple applications, enfore content separation by the ACL concept and limit the access to templates for users. Sounds good. But — as always — there are problematic details.

I will cover these problems and ways to their resolution in the next posts.

truss debugging

May 26, 2009 by jhoh228

truss (on Linux: strace) is a really handy tool. According to it’s manual page:

The truss utility executes the specified command and produces a trace of the system calls it performs, the signals it receives, and the machine faults it incurs.

So it’s a good thing if you’re a bit familiar with system calls, which are made from a userspace programm to the kernel of your operation system. Ok, that sounds a little scary, but it is very heplfull, when you have a programm, which doesn’t work as expected, but doesn’t provide any useful information (even with increased loglevel). truss outputs the systemcall, it’s parameters (usually limited to about 20-30 characters, so it will fit on a single output line) and the return value of the system call.

Of course you don’t see the internal processing of an process (eg. for pure calculations there’s no need to use kernel functions), but very often in operation the problem isn’t the internal processing, but missing files, files in wrong directories, missing access rights and so on. These things cause problems quite often, and an experienced Unix administrator uses then truss. In most times the adminstrator just focusses on some kernel calls, which do I/O on the filesystem.

I use this tool quite often and I’m surprised, how much problems I found and resolved just by looking at the in- and output system calls; recently we had a hanging authoring cluster which doesn’t respond any more, even after having restarted both nodes. The logs showed nothing obvious, only truss revealed, that there’s was a “lock”-file, which probably was a left-over from a operation (it must have been a race-condition in the code, we haven’t found yet). And the cluster-sync of both processes waited for this lock-file to disappear, so they could write now theirselves. And of course it doesn’t, so the processes hung. So, truss showed me that these processes were repeatedly testing the existence of this file. I removed it and voila, the cluster sync went back to operation.

Using truss worked so often, that in most cases, where I/O is affected, I use truss first, even before I increase the loglevel. Colleagures also have these procedure when they have problems with their applications. We use the term “truss debugging” ;-)

Oh, if you have performance issues with I/O, truss is also a nice tool. You can check, on what files I/O is performed, and with what read/write ratio. You can even see, what buffer size is used. About 3 years ago we filled a bug on CQ 3.5.5 because CQ read its default.map byte by byte; Day provided a hotfix and the startup time went down from 1 hour to about 15 minutes. The solution was to read the default.map in 2k blocks (or 4k, doesn’t matter as long it isn’t byte-wise); this issue was only noticable by either reading the source code or invoking truss.