Archive for November, 2009

Application monitoring vs System monitoring

November 20, 2009

Recently I was asked how a CQ monitoring should be setup. The question was acompanied by a very short description, how the monitoring was supposed to look like. There were some points like “monitoring the publishing by requesting pages”, “check if the java process is running” and “checking the free disk space”. Obviously they just setup some new servers for this environment and thought that they need to monitor some parameters.

As a first step I advised to separate the topics “application monitoring” and “system monitoring”. One might wonder why I suggest to make a strong division between these topics, so here the background.

Standardization is one of the key topics in IT; everything, what is standardized, can be reused, can be exchanged by a compatible product, and finally lowers the cost. So IT operation teams tend to standardize as much as they can, because as intermediate step to lower costs standardization allows automation.

Basic system monitoring is such a thing. Every computer has componentes, which can be monitored such way: Disk health, CPU temperature, status of the power supply units, internal temperature. But also CPU utlization, free disk space, network connectivity or if the system starts to swap. And many more. These are basic metrics which can be measured and monitored in a consistent and automatic way.

For these points it doesn’t care if the system runs a data warehouse application, a mailserver or CQ. They are all the same and the reaction is really comparable if one of these monitored things fails: If a disk is dead, one needs to replace it (with not-so-old servers you can do this online and without service interruption). The procedure may differ from computer to computer, but the basic action is always the same: When the monitoring shows that a disk failed, lookup the type of the failed disk, get a new one, and go the computer and replace it according to the guidelines of the computer manufacturer. That’s it. You can handle some thousand servers that way with only a few people.

Running applications isn’t standardized that way. One application requires a Windows Server, other run because of their history only on big iron. One vendor offers performance guarantees only for linux systems, and other vendors don’t care about the platform as long as they have a Websphere Application Server as base. Some applications are designed to run centralized, other applications can be clustered. Some have good logging and messages you can use for diagnosis, others don’t have that and error causes must be detected with system tools like truss or strace.
So applications are highly non-standardized and often need special skill and knowledge in order to operate them. Automatisation is a very hard job here, and there must be support by management to get every part of the organisation in the right direction.

(As a side note: In my former life before I joined Day I worked in a large IT operation organisation where every application was somehow non-standard; some less, but also some completly out of every order. IT tried its best to create some kind of standardization, but the busineses often didn’t care that much about it; also developers didn’t knew much about IT operations, so “but it works on my machine!!” and “Just open the firewall, so these 2 components can talk to each other” was often heard in early project stages.)

These applications also need completly different kinds of monitoring. The implementation for SAP monitoring looks different than the application monitoring for a web application. The actions the take in case of problems probably differ even more; and when it comes to investigate on errors the webapplication administrator cannot do anything on the SAP system. And vice-versa.

So it’s advisable to separate the monitoring into 2 parts: The basic system monitoring and the application monitoring.

The system monitoring part can be done by one team for all servers. The application monitoring is too complex and too different, the actions sometime require so often special knowhow, that it must be adjustable to the needs of each application and application administrators.

As a final conclusion: Everytime a computer system is setup, put it into the basic system monitoring. So failing disks can get replaced.
And when the application administrator deploys the application on it, the special monitoring stuff is installed then.
Just because the needs and skills, which it takes to react on monitored issues, are very different.

Basic performance tuning: Caching

November 4, 2009

Many CQ installations I’ve seen start with the default configuration of CQ. This is in fact a good decision, because the default configuration can handle small and middle installations very well. And additionally you don’t have to maintain a bunch of configuration files and settings; and finally most CQ hotfixes (which are delivered without the QA) are only tested with default installations.

So when you start with your project and you have a pristine CQ installation, the performance of both publishing and authoring instances are usually very good, the UI is responsive, page load times in the 2-digit miliseconds. Great. Excellent.

When your site grows, when the content authors start their work, you need to do your first performance and stress tests using numbers provided by the requirements (“the site must be able to handle 10000 concurrent requests per second with a maximal response time of 2 seconds”). You either can overcome such requirements by throwing hardware on the problem (“we must use 6 publishers each on a 4-core machine”) or you just try to optimize your site. Okay, let’s try it with optimization first.

Caching is a thing which comes to mind first. You can cache on several layers of the application, be it application level (caches builtin into the application, like the outputcache of CQ 3 and 4), the dispatcher cache (as described here in this blog), or on the users system (using the browser cache). Each cache layer should decrease the number of requests in the remaining caches, so that in the end only the requests get through, which cannot be handled in a cache, but must be processed in CQ. Our goal is to move the files into a cache which is nearest to the enduser; then loading of these files is faster than if the load is performed from a location which is 20 000 kilometers away.

(A system engineer may also be interested in that solution, because it will offload data traffic from the internet connection. Leaves more capacity for other interesting things …)

If you start from scratch with performance tuning, grasping for the low-hanging fruits is the way to go. So you start into an iterative process, which contains of the following steps:

  1. Identify requests which can be handled by a caching layer which is placed nearer to the enduser.
  2. Identify actions, which allows to cache these requests in a cache next to the user.
  3. Perform these actions
  4. Measure the results using appropriate tools
  5. Start over from (1)

(For a more broader view to performance tuning, see David Nueschelers post on the Day developer site)

As an example I will go through this cycle on the authoring system. I start with a random look at the request.log, which may look like this:

09/Oct/2009:09:08:03 +0200 [8] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:06 +0200 [8] <- 200 text/html; charset=utf-8 3016ms
09/Oct/2009:09:08:12 +0200 [9] -> GET / HTTP/1.1
09/Oct/2009:09:08:12 +0200 [9] <- 302 - 29ms
09/Oct/2009:09:08:12 +0200 [10] -> GET /index.html HTTP/1.1
09/Oct/2009:09:08:12 +0200 [10] <- 302 - 2ms
09/Oct/2009:09:08:12 +0200 [11] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:13 +0200 [11] <- 200 text/html; charset=utf-8 826ms
09/Oct/2009:09:08:13 +0200 [12] -> GET /libs/wcm/welcome/resources/welcome.css HTTP/1.1
09/Oct/2009:09:08:13 +0200 [12] <- 200 text/css 4ms
09/Oct/2009:09:08:13 +0200 [13] -> GET /libs/wcm/welcome/resources/ico_siteadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [14] -> GET /libs/wcm/welcome/resources/ico_misc.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] -> GET /libs/wcm/welcome/resources/ico_useradmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] <- 200 image/png 8ms
09/Oct/2009:09:08:13 +0200 [16] -> GET /libs/wcm/welcome/resources/ico_damadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [16] <- 200 image/png 5ms
09/Oct/2009:09:08:13 +0200 [13] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [14] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [17] -> GET /libs/wcm/welcome/resources/welcome_bground.gif HTTP/1.1
09/Oct/2009:09:08:13 +0200 [17] <- 200 image/gif 3ms

Ok, it looks like that some of such requests must not be handled by CQ: the PNG files and the CSS files. These files usually never change (or at least change very seldom, maybe on a deployment or when a hotfix is deployed). But for the usual daily work of an content author they can be assumed to be static, but we must of course provide a way that we enable the authors to fetch a new one, when an update to one them occurs. Ok, that was step 1: We want to cache the PNG and the CSS files which are placed below /libs.

Step 2: How can we cache these files? We don’t want to cache them within CQ (that wouldn’t bring any improvement), so remains dispatcher and browser cache. In this case I recommend to cache them in the browser cache for 2 reasons:

  • These files are requested more than once during a typical authoring session, so it makes sense to cache directly in the browser cache.
  • Latency of the browser cache is ways lower than the latency of any load from the network.

As an additional restriction which speaks against the dispatcher:

  • There are no flusing agents for authoring mode, so we cannot use the dispatcher that easily. So in the case of tuning an authoring instance we cannot use the dispatcher cache.

And to make any changes to these files made on the server visible to the user, we can use the expiration feature of HTTP. This allows us to specify a time-to-live, which basically tells any interested party, how long we consider this file up-to-date. When this time is reached, every party, which cached it, should remove it from cache and refetch.
This isn’t the perfect solution, because a browser will drop the file from its cache and refetch it from time to time, although the file is still valid and up-to-date.
But there’s still an improvement, if the browser fetches this files every hour instead of twice a minute (when a page load occurs).

Our prognose is, that the browser of an authoring user won’t perform that much requests on files anymore; this will increase the rendering performance of the page (the files are fetched from the fast browsercache instead from the server), and additionally the load on the CQ will decrease, because it doesn’t need to handle that much requests. Good for all parties.

Step 3: We implement this feature in the apache webserver, which we have placed in front of our CQ authoring system and add the following statements:

<LocationMatch /libs>
ExpiresByType image/png "access plus 1 hour"
ExpiresByType text/css "access plus 1 hour"
</LocationMatch>

Instead of relying on file extensions we specify here the expiration by the MIME-type in these rules. The files are considered to be up-to-date for an hour, so the browser will reload these files every hour. This value should be ok also in case these files are changed once. And if everything fails, the authoring users can drop their browser cache.

Step 4: We measure the effect of our changes using 2 different strategies: First we observe the request.log again and check if these requests appear further on. If the server is already heavy loaded, we can additionally check for a decreasing load and an improved response times for the remaining requests. As a second option we take a simple use case of an authoring user and run it with Firefox’ Firebug extension enabled. This plugin can visualize how and when the load of the parts of a page happen, and display the response times quite exactly. You should see now, that the number of files requested over the network has decreased and the load of a page and all its emnbedded objects is faster than before.

So with an quick and easy-to-perform action you have decreased the page load times. When I added expiration headers to a number of static images, javascripts and css files on a publishing instance, the number of requests which went over the wire went down to 50%, the pageload times also decreased, so that even during a stress test the site still had a good performance. Of course, dynamic parts must be handled by their respective systems, but if we can offload requests from CQ, we should do this.

So as a conclusion: Some very basic changes to the system (some configuration adjustments to the apache config) may increase the speed of your site (publishing and authoring) dramatically. Such changes as described are not invasive to the system and are highly adjustible to the specific needs and requirements of your application.