Recently I was asked how a CQ monitoring should be setup. The question was acompanied by a very short description, how the monitoring was supposed to look like. There were some points like “monitoring the publishing by requesting pages”, “check if the java process is running” and “checking the free disk space”. Obviously they just setup some new servers for this environment and thought that they need to monitor some parameters.
As a first step I advised to separate the topics “application monitoring” and “system monitoring”. One might wonder why I suggest to make a strong division between these topics, so here the background.
Standardization is one of the key topics in IT; everything, what is standardized, can be reused, can be exchanged by a compatible product, and finally lowers the cost. So IT operation teams tend to standardize as much as they can, because as intermediate step to lower costs standardization allows automation.
Basic system monitoring is such a thing. Every computer has componentes, which can be monitored such way: Disk health, CPU temperature, status of the power supply units, internal temperature. But also CPU utlization, free disk space, network connectivity or if the system starts to swap. And many more. These are basic metrics which can be measured and monitored in a consistent and automatic way.
For these points it doesn’t care if the system runs a data warehouse application, a mailserver or CQ. They are all the same and the reaction is really comparable if one of these monitored things fails: If a disk is dead, one needs to replace it (with not-so-old servers you can do this online and without service interruption). The procedure may differ from computer to computer, but the basic action is always the same: When the monitoring shows that a disk failed, lookup the type of the failed disk, get a new one, and go the computer and replace it according to the guidelines of the computer manufacturer. That’s it. You can handle some thousand servers that way with only a few people.
Running applications isn’t standardized that way. One application requires a Windows Server, other run because of their history only on big iron. One vendor offers performance guarantees only for linux systems, and other vendors don’t care about the platform as long as they have a Websphere Application Server as base. Some applications are designed to run centralized, other applications can be clustered. Some have good logging and messages you can use for diagnosis, others don’t have that and error causes must be detected with system tools like truss or strace.
So applications are highly non-standardized and often need special skill and knowledge in order to operate them. Automatisation is a very hard job here, and there must be support by management to get every part of the organisation in the right direction.
(As a side note: In my former life before I joined Day I worked in a large IT operation organisation where every application was somehow non-standard; some less, but also some completly out of every order. IT tried its best to create some kind of standardization, but the busineses often didn’t care that much about it; also developers didn’t knew much about IT operations, so “but it works on my machine!!” and “Just open the firewall, so these 2 components can talk to each other” was often heard in early project stages.)
These applications also need completly different kinds of monitoring. The implementation for SAP monitoring looks different than the application monitoring for a web application. The actions the take in case of problems probably differ even more; and when it comes to investigate on errors the webapplication administrator cannot do anything on the SAP system. And vice-versa.
So it’s advisable to separate the monitoring into 2 parts: The basic system monitoring and the application monitoring.
The system monitoring part can be done by one team for all servers. The application monitoring is too complex and too different, the actions sometime require so often special knowhow, that it must be adjustable to the needs of each application and application administrators.
As a final conclusion: Everytime a computer system is setup, put it into the basic system monitoring. So failing disks can get replaced.
And when the application administrator deploys the application on it, the special monitoring stuff is installed then.
Just because the needs and skills, which it takes to react on monitored issues, are very different.