Category Archives: healthcheck

When is AEM fully started?

Or in other words: How can I know that the instance is fully working?

A common task when you work with automation is a realiable detection when the AEM instance is up and running. Maybe you reconfigure the loadbalancer to send requests to this instance. Or you just start doing some other work.

The most naive approach is to request a AEM page and act on the HTTP status code. If the status is “200”, you consider the system up and running. If you get any other code, it’s not. Sounds easy, is easy. But not really accurate. Because there are times during startup, when the system returns a status code 200, but a blank page. Unfortunate.

So next approach: Check if all bundles are active. Check /system/console/bundles.json and parse it. Look for a statement like this:

status":"Bundle information: 447 bundles in total - all 447 bundles active.

Nice try, but does not work. All bundles being up does not guarantee, that all the services are up as well.

The third approach is more compplicated and requires coding, but delivers good results: Build a healthcheck which depends on a lot of other services (the ones you consider important). If this healthcheck is present and delivers ok, it means, that all services it depends on are active as well (the simple default semantic of the @Reference annotation guarantees that). This does not necessarily mean, that the startup is finished, but just that the services you considered relevant are up.

And finally there is a fourth approach, which has been built specifically for this case: The startup listeners. It’s a service interface you can implement, and you get notified when the system is up. That’s it. The API does not give any guarantee that if the system is up, that 5 minutes later it is still up. I am not 100% sure so the semantics of this approach if a service fails to start. Or if a service decides to stop (or starts throwing exceptions).

The healthcheck is my personal favorite. It can be used not  only to give you information about a single event (“the system is up”), but it can take much more factors into account to decide if the system is up. And these factors can be constantly checked. When a service is no longer available, the healthcheck goes to ERROR (“red”), and it’s available again, the healthcheck reports OK again. The approach is more powerfull, provides better extensibility and is quite easy to understand. So I choose a healthcheck everytime when I need to know about the health state of AEM.

 

 

 

CQ5 healthcheck: backport for CQ 5.4

I learned, that there a quite a number of projects out there, which are (still?) bound to CQ 5.4 and cannot move forward to a newer version right now. For these I created a backport of the healthcheck version 1.0, which works reasonable well on my personal instance of CQ 5.4. You can find the code on github in the release-1.0-cq54 branch, but I don’t provide a compiled binary version.

The main changes to the master branch:

  • I backported the 1.0 branch, not master. Currently the changes aren’t that hard, so you can maintain a branch “master-cq54” on your own.
  • Adjusted pom files; no code changes required due to this, but only
  • The PropertiesUtil class is not there, but you can replace 1:1 with the OsgiUtil class available in CQ 5.4
  • use “sling:OsgiConfig” nodes instead of nt:files nodes with the extension “.config” (the later is available on CQ 5.5 and later)
  • CQ 5.4 does not support sub-folders within the config folder, you need to put all config nodes there.

And of course the biggest limitation:

  • For replication there is no ootb JMX support, therefor I dropped the respective config nodes.
  • If you want to contribute support for this feature, you’re welcome 🙂

So have fun with it.

CQ5 healtcheck — how to use

The recent announcement of my healthcheck project caused some buzz, most related to how it can be used. So I want to show you, how you can leverage the framework for you.

The statuspage

First, the package already contains a status page (reachable via <host>/content/statuspage.html), which looks like this:

screenshot CQ5 healthcheck
The first relevant piece of information is the “Overall Status”: It can be “OK”,”WARN” or “CRITICAL”.

This information is computed out of all the invidual checks which are listed in the details table according to this ruleset:

  • If a least 1 check returns CRITICAL, the overall status is “CRITICAL”.
  • If at least 1 check returns WARN an no check returns CRITICAL, the overall status is “WARN”.
  • If all status return OK, the overall status is “OK”.

The overall status is easily parseable on the statuspage by a monitoring system.

The indivual checks are listed by name, status and an optional message. This list should be used to determine which check failed and caused the overall status to deviate from OK.

The status in detail:

  • OK: obvious, isn’t it?
  • WARN: the checked status is not “OK, but also not CRITICAL. The system is still usable, but you need to observe the system closer, or need to perform some actions, so the situation won’t get worse.
  • CRITICAL: The system should not be used and user experience will be impacted. Actions required.

Managing the loadbalancers

Any loadbalancer in front of CQ5 instances also should be aware of the status of the instance. But loadbalancers probes much more often (about every 30 seconds), and they don’t have that much capabilities to parse complex data. For this usecase there is the “/bin/loadbalancer” servlet, which returns only “OK” with a statuscode 200, or “WARN” with a statuscode “500”. WARN indicates both WARN and CRITICAL, in both cases it is assumed, that the loadbalancer should not send requests to that instance.

That’s for now. If you have feedback, mail me or just create an issue at github.

CQ5 healthcheck version 1.0

My colleague Alex already disclosed it already in early December, but it was still not ready. But in the meantime I think, that it’s actual time to release it.

So here it is: The CQ5 health check. A small and easy to understand framework to monitor the status of your CQ5 instance. It’s main features are

  •    Usable out of the box.
  •    All MBeans can be monitored just by configuration
  •    Extendable by simple OSGI services
  •    Features an extended status page as well as a machine interface for automatic monitoring

And the best: the source is freely available on Github; a package ready for installation is available on Packageshare , and the installation is very easy:

  1.    Download and install the package from package share
  2.    Goto http://<your-instance&gt;:4502/content/statuspage.html
  3.    Enjoy

So feel free to download it,install it, fork it, extend it. The code is licensed under Apache License, so you don’t have to disclose your extensions and modifications at all. But I love to get contributions back 🙂

So, currently the most useful informations are stored in the README file, but I hope that I can move this information over to the project wiki. This is just the announcement; I plan to add some posts to this blog how you can write your own health checks (which isn’t hard by the way).

Enjoy your new toy, and I love to get feedback from you, either here on the blog, via twitter (@joerghoh) or in geek-style via pull-requests.

And many thanks to Alex Saar and Markus Haack for they support and contributions.