Sling healthchecks – what to monitor

The ability to know what’s going on inside an application is a major asset whenever you need to operate an application. As IT operation you don’t need (or want) to know details. But you need to know if the application works “ok” by any definition. Typically IT operations deploys alongside with the application a monitoring setting which allows to make the statement, if the application is ok or not.

In the AEM world such statements are typically made via Sling healthchecks. As a developer you can write assertions against the correct and expected behaviour of your application and expose the result. While this technical aspect is understood quite easily, the more important question is: “What should I monitor”?

In the last years and projects we typically used healthchecks for both deployment validation (if the application goes “ok” after a deployment, we continued with the next instance) and loadbalancer checks (if the application is “ok”, traffic is routed to this instance). This results in these requirements:

  • The status must not fluctuate, but rather be stable. This normally means, that temporary problems (which might affect only a single request) must no influence the result of the healthcheck. But if the error rate exceeds a certain threshold it should be reported (via healthchecks)
  • The healthcheck execution time shouldn’t be excessive. I would not recommend to perform JCR queries or other repository operations in a healthcheck, but rather consume data which is already there, to keep the execution time low.
  • A lot of the infrastructure of an AEM instance can be implicitly monitored. If you expose your healthcheck results via a page (/status.html) and this page results with a status “200 OK”, then you know, that the JVM process is running, the repo is up and that the sling script resolution and execution is working properly.
  • When you want your loadbalancer to determine the health status of your application by the results of Sling healthchecks, you must be quite sure, that every single healthcheck involved works in the right way. And that an “ERROR” in the healthcheck really means, that the application itself is not working and cannot respond to requests properly. In other words: If this healthchecks goes to ERROR on all publishs at the same time, your application is no longer reachable.

Ok, but what functionality should you monitor via healthchecks? Which part of your application should you monitor? Some recommendations.

  1. Monitor pre-requisites. If your application constantly needs connectivity to a backend system, and is not working properly without it, implement a healthcheck for it. If the configuration for accessing this system is not there or even the initial connection start fails, let your healthcheck report “ERROR”, because then the system  cannot be used.
  2. Monitor individual errors: For example, if such a backend connection throws an error, report it as warn (remember, you depend on it!). And implement an error threshold for errors, and if this threshold is reached, report it as ERROR.
  3. You should implement a healthcheck for every stateful OSGI service, which knows about “success” or “failed” operations.
  4. Try Avoid the case, that a single error is reported via multiple healthchecks. On the other side try to be as specific as possible when reporting. So instead of “more than 2% of all requests were answered with an HTTP statuscode 5xx” via healthcheck1 you should report “connection refused from LDAP server” in healthcheck 2. In many cases fundamental problems will trigger a lot of different symptoms (and therefor cause many healthchecks to fail) and it is very hard to change this behaviour. In that case you need to do document explicitly how to react in such responses and how to find the relevant healthcheck quickly.

Regarding the reporting itself you can report every problem/exception/failure or work with the threshold.

  • Report every single problem if the operations runs rarely. If you have a scheduled task with a daily interval, the healthcheck should report immediately if it fails. Also report “ok” if it works well again. The time between runs should give enough time to react.
  • If your operation runs very often (e.g. as part of every request) implement a threshold and report only a warning/error if this threshold is currently exceeded. But try to avoid constantly switching between “ok” and “not ok”.

I hope that this article gives you some idea how you should work with healthchecks. I see them as a great tool and very easy to implement. They are ideal to perform one-time validations (e.g. as smoketests after a deployment) and also for continous observation of an instance (by feeding healthcheck data to loadbalancers). Of course they can only observe the things you defined up front and do not replace testing. But they really speedup processes.

AEM and docker – a question of state

The containerization of the IT world continues. What started with virtualization in the early 2000s has reached with Docker a state, where it’s again a hype topic.

Therefor it’s natural that people also started to play with AEM in docker (, and many more).

Of course I was challenged with the requirement to run AEM in docker too. Customers and partners asking how to run AEM in docker. If I can provide dockerfiles etc.  I am hestitating to do it, because for me docker and AEM are not a really good fit (right now with AEM 6.3 in 2017).

Some background first: Docker containers should be stateless. Only if the application within the container does not hold any persistent state, you can shut it down (which means deleting all the files created by the application in the container itself), start it up, replace it by a different container holding a new version of the application etc. The whole idea is to make the persistent state somebody else’s problem (typically a database). Deployments should be as easy as starting new docker instances (from a pre-tested and validated docker images) and shutting down the old ones. Not working and testing in production anymore.

So, how does that collide with AEM? AEM is not only an application, but the application is closely tied with a repository, which holds state. Typically the application is stored within the repository, next to the “user data” (= content). This means, that you cannot just replace an AEM instance inside docker by a new instance without loosing this content (or resetting it to a state, which is shipped with the docker image). Loosing content is of course not acceptable.

So the typical docker rollout approach of new application versions (bringing new instances live based on a new docker image and shutting down the old ones) does not work with AEM; the content sitting in the repository is the problem.

People then came up with the idea, that the repository can stored outside of the docker image, so isn’t lost on restart/replacement of the image. Docker calls this “host directory as data volume” (

Storing the repo as data volume on the host filesystem

That idea sounds neat and of course it works. But then we have a different problem. When you start a new docker image and you mount this data volume containing the repository state, your AEM still runs the “old” version of your application. Starting the repository from a different docker image doesn’t bring any benefit then.

Docker image version 2 still starts application version 1.0

When you want to update your AEM application inside the repository, you would still need to perform an installation of your application into a running repository. Working in a production environment. And that’s not the idea why you want to use docker.
With docker we just wanted to start the new images and to stop the old ones.

Therefor I do not recommend to use docker with AEM; there is rarely a value for it, but it makes the setup more complicated without any real benefit.

The only exceptions I would accept are really short-lived instances, where hosting the repository inside the docker system isn’t a problem and purging the repo on shutdown is even a feature. Typically these are short-lived development instances (e.g. triggered by Continous integration pipeline, where you automatically create dedicated docker instances for feature branches). But that’s it.

And as a sidenote: This does not only affect TarMK-based AEM instances. If you have mongo-based instances, the application is also stored within the (Mongo-) repo. Just running AEM in a new docker image doesn’t update the application magically.

To repeat myself: This considers the current state. I know that the AEM engineering is perfectly aware of this fact, and I am sure that they try to adress it. Let’s wait for the future 🙂

What’s new in Sling with AEM 6.3

AEM 6.3 is out. Besides the various product enhancements and new features it also includes updated versions of many open source libraries. So it’s your chance to have a closer at the version numbers and find out what changed. And again, quite a lot.

For your convenience I collected the version information of the sling bundles which are part of the initial released version (GA). Please note that with Servicepacks and Cumulative Hotfixes new bundle versions might be introduced.
If a version number is marked red, it has changed compared to the previous AEM versions. Bundles which are not present in the given version are marked with a dash (-).

For the versions 5.6 to 6.1 please see this older posting of mine.

Symbolic name of the bundle AEM 6.1 (GA) AEM 6.2 (GA) AEM 6.3 (GA) 2.1.4 2.1.6 (Changelog) 2.1.8 (Changelog) 2.9.0 2.11.0 (Changelog) 2.16.2 (Changelog) 0.9.0.R988585 0.9.0.R988585 0.9.0.R988585 1.3.6 1.3.14 (Changelog) 1.3.24 (Changelog) 0.0.1.R1582230 1.0.6 (Changelog) 2.2.0 2.2.0 2.2.0 1.1.0 1.2.0 1.2.0 1.3.2 1.3.2 1.3.8 (Changelog) 2.2.0 2.2.0 2.3.0 (Changelog) 1.0.2 1.0.2 1.0.0 1.0.2 (Changelog) 1.0.4 (Changelog) 1.0.0 1.0.0 1.0.0 2.0.10 2.0.16 (Changelog) 2.0.20 (Changelog) 4.0.2 4.0.6 (Changelog) 5.0.0 (Changelog) 1.0.0 1.0.4 1.0.6 (Changelog) 1.0.6 1.0.0 (Changelog) 1.2.0 (Changelog) 2.1.8 2.1.8 2.1.10 (Changelog) 2.2.2 2.4.0 (Changelog) 2.4.0 2.4.6 2.4.14 (Changelog) 2.5.2 (Changelog) 3.2.0 3.2.6 (Changelog) 3.2.6 1.0.0 1.0.0 1.0.2 (Changelog) 1.0.2 1.0.2 1.0.4 (Changelog) 1.1.2 (Changelog) 1.1.6 (Changelog) 1.0.10 (Changelog) 1.0.18 (Changelog) 1.1.0 1.2.6 (Changelog) 1.2.16 (Changelog) 1.0.0 1.0.0 1.0.0 0.1.0 0.3.0 0.3.0 0.1.1.r1678168 0.1.15.r1733486 0.2.6 2.4.2 2.4.6 (Changelog) 2.6.6 (Changelog) 3.5.5.R1667281 4.0.0 (Changelog) 4.2.0 (Changelog) 1.0.0 1.0.4 (Changelog) 1.1.0 (Changelog) 0.2.2 1.1.4 1.1.6 (Changelog) 1.2.0 (Changelog) 1.0.0 1.0.2 (Changelog) 1.2.0 (Changelog) 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.0.0 1.0.0 1.2.0 1.2.2 (Changelog) 1.2.4 (Changelog) 1.1.2 1.1.2 1.1.2 2.4.0 2.4.4 (Changelog) 2.5.6 (Changelog) 1.0.0 1.0.0 1.0.2 (Changelog) 3.6.4 3.6.8 (Changelog) 3.8.6 (Changelog) 1.1.2 1.1.2 1.1.2 1.0.0 1.1.0 1.1.0 1.1.0 3.1.16 3.1.18 (Changelog) 3.1.22 (Changelog) 0.1.0 0.1.0 0.1.0 2.2.0 2.3.0 (Changelog) 2.4.0 (Changelog) 2.2.2 2.3.2 (Changelog) 3.0.0 (Changelog) 2.1.0 2.1.10 2.1.10 2.1.10 1.2.2 1.3.4 (Changelog) 1.3.8 (Changelog) 2.0.0 2.0.0 2.0.0 1.0.2 1.0.2 1.0.2 1.1.2 2.5.0 2.7.4.B001 (Changelog) 2.9.2 (Changelog) 1.0.2 1.0.2 1.0.2 2.2.2 2.3.4 (Changelog) 2.3.8 (Changelog) 1.0.2 1.0.2 1.0.2 1.2.0 1.2.2 (Changelog) 1.2.2 1.1.0 1.2.2 (Changelog) 1.3.2 (Changelog) 1.1.0 1.2.2 (Changelog) 1.3.9.r1784960 (Changelog) 1.0.6 1.8.0 1.1.0 1.0.4 1.0.4 1.0.6 (Changelog) 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.2.9.R1675563-B002 1.3.0 (Changelog) 1.3.0 1.2.4 1.4.8 (Changelog) 1.5.22 (Changelog) 1.0.4 1.1.2 (Changelog) 1.2.1.R1777332 (Changelog) 2.1.6 2.1.8 (Changelog) 2.1.12 (Changelog) 2.0.28 2.0.36 (Changelog) 2.0.44 (Changelog) 2.0.12 2.0.14 (Changelog) 2.1.2 (Changelog) 2.0.16 2.0.28 2.0.30 2.1.6 2.1.8 2.2.6 2.2.4 2.2.4 2.2.6 2.0.6 2.0.6 2.0.6 1.0.2 1.0.18 1.0.32 1.0.8 1.0.8 1.0.4 1.0.10 1.0.18 1.0.0 1.0.6 1.0.10 1.0.18 1.1.2 1.2.0 1.2.2 1.2.4 1.0.0.Revision1200172 1.0.0.Revision1200172 1.0.0.Revision1200172 2.1.10 2.1.14 2.1.22 2.3.6 2.3.8 2.3.15.r1780096 2.3.6 2.4.2 2.4.10 1.3.6 1.3.8 1.3.8 0.0.1.Rev1526908 0.0.1.Rev1526908 0.0.1.Rev1764482 0.0.1.Rev1387008 0.0.1.Rev1387008 0.0.1.Rev1758544 1.0.2 1.0.2 1.1.0 0.0.2 1.0.2

AEM 6.3: set admin password on initial startup (Update)

With AEM 6.3 you have the chance to setup the admin password already on the initial start. By default the quickstart asks you for the password if you start it directly. That’s a great feature and shortens quite some deployment instructions, but it doesn’t work always.

For example, if you first unpack your AEM instance and then use the start script, you’ll never get asked for the new admin password. The same if you work in an application server setup. And if you do automatic installations, you don’t want to get asked at all.

But I found, that even in these scenarios you can set the admin password as part of the initial installation. There are 2 different ways:

  1. Set the system property “admin.password” to your desired password; and it will be used (for example add “-Dadmin.password=mypassword” to the JVM parameters).
  2. Or set the system property “admin.password.file” and pass as value the path to a file; when this file is accessible by the AEM instance and the contains the line “admin.password=myAdminPassword“, this value will be used as admin password.

Please note, that this only works on the initial startup. On all subsequent startups these system properties are ignored; and you should probably remove them or at least purge the file in case of (2).

Update: Ruben Reusser mentioned, that the Osgi Webconsole Admin password is not changed (which is used in case the repository is not running). So you still need to work on that.

What I check on code reviews

At several occassions I did code reviews on AEM projects  in the last months. I don’t do that exercise quite often, so I don’t have a real routine or checklists what to look at. But in the past I learned some lessons about how to write code for AEMs, so I hope I check relevant pieces. Feedback appreciated.

So let’s start with my top 10 items I look for:

  1. The use of “System.out.println()“, “System.err.println()” and “e.printStackTrace()” statements. Logging is cheap and easy, but obviously not easy enough, I still find these statements. They should be replaced, because these statements do not provide relevant metadata like time and class. And to be honest, people tend to look only at the error.log file, but not on stdout.log.
  2. Servlets bound to fixed paths. In most cases it should be replaced by binding either to a selector or to a resourcetype. The Sling documentation explains it quite well.
  3. The creation of dedicated sessions/ResourceResolvers (either admin sessions ot service user sessions) and if these sessions are properly closed. Although this should be common knowledge to AEM developers, there’s still code out there which doesn’t close resource resolvers or logs out sessions, causing memory leaks.
  4. The existence of long-running sessions. You shouldn’t write services, which open a session on activate and close them on deactive (see this blog post for the explanation). The only exception to this rule: JCR observation handlers.
  5. adaptTo() calls without proper null checks. adaptTo() is allowed to return null. There are cases where it can be neglected (in reality you’ll never have all occurrences of it checked), but in most cases it has to be checked to avoid NullPointerException.
  6. Log hygiene part 1: The excessive use of LOG.error(), when a is sufficient. Or even worse: LOG.error/info() instead of LOG.debug().
  7. Log hygiene part 2: Log messages without meaningful description. A log message has to contain relevant information like “with what parameter does this exception happen? At what node? Which request?”. Consider that some parts are always and implicitly logged (e.g. the thread name, which contains in case of a request the requested path), but you need to provide every other information which can be useful to understand the problem when found in the logs.
  8. The mixed usage of JCR and Sling API. Choose either one, but then stick to it. You should not have a method, which takes both a Session and a ResourceResolver as parameter (or other object from these domains).
  9. Performing string operations on paths. I already blogged about it.
  10. JCR queries. Are they properly used or can they get replaced by a short tree traversal?

So when you get through all of these quite generic items, your code is already quite well. And if you don’t have a specific problem which I should investigate, I will likely stop here. Because then you already proved, that you understand AEM and how to operate it quite well, so I wouldn’t expect major issues anymore.

I am sure that the personal background influences a lot the intuitive approach on code review. Therefor I am interested in your checklists and how it differs from mine. Just leave a comment, drop me a mail or tweet me 🙂

Let’s try to compile a list which we all can use to improve our code.

AEM coding best practice: No String operations on paths!

I recently needed to review project code in order to validate if it makes problems when upgrading from AEM 5.6 to AEM 6.x; so my focus wasn’t on the code in the first place, but on some other usual suspects (JCR queries etc). But having seen a few dozen classes I found a pattern, which I then found all over the code: the excessive use of String operations. With a special focus on string operations on repository paths.

For example something like this:

String[] segments = resource.getPath.split("/");
String settingsPath = "/" + StringUtils.join(segments,"/",0,2) + "/settings/jcr:content";
Resource settings = resourceResolver.get(settingsPath);
ValueMap vm = settings.adaptTo(ValueMap.class);
String language = vm.get("language");

(to read settings, which are stored in a dedicated page per site).

Typically it’s a weird mixture of String methods, the use of StringUtils classes plus some custom helpers, which do things with ResourceResolvers, Sessions and paths. Spread all over the codebase. Ah, and it lacks a lot of error checking (what if the settings page doesn’t contain the “language” property? adaptTo() can return “null”).

Sadly that problem not limited to this specific project code, I found it in many other projects as well.

Such a code structure is a clear sign for the lack of abstraction and guidance. There are no concepts available, which eliminate the need to operate on strings, but the developer is left with the only abstraction he has: The repository and CRXDE Lite’s view on it. He logs into the repository, looks at the structure and then decides how to mangle known pieces of information to get hold of the things he needs to access. If there’s noone which reviews the code and rejects such pieces, the pattern emerges and you can find it all over the codebase. Even if developers create utility classes for (normally every developer creates one on its own), it’s a bad approach, because these concepts are not designed (“just read the language from the settings page!”), but the design “just happens“; there is no focus on it, therefor quality is not enforced and error handling typically quite poor.

Another huge drawback of this approach: It’s very hard to change the content structure, because at many levels assumptions about the content structure are used, which are often not directly visible. For the example the constants “0” and “2” in the above code snippets determine the path prefix, but you cannot search for such definitions (even if they are defined as constant values).

If the proper abstraction would be provided, the above mentioned code could look like this:

String language = "en";
Settings settings = resource.adaptTo(Settings.class);
If (settings != null) {
  language = settings.getLanguage();

This approach is totally agnostic of paths and the fact that settings are stored inside a page. It hides this behind a well-defined abstraction. And if you ever need to change the content structure you know where you need to change your code: Only in the AdapterFactory.

So whenever you see code which uses String operations on paths: Rethink it. Twice. And then come up with some abstraction layer and refactor it. It will have a huge impact on the maintainability of your code base.

When is AEM fully started?

Or in other words: How can I know that the instance is fully working?

A common task when you work with automation is a realiable detection when the AEM instance is up and running. Maybe you reconfigure the loadbalancer to send requests to this instance. Or you just start doing some other work.

The most naive approach is to request a AEM page and act on the HTTP status code. If the status is “200”, you consider the system up and running. If you get any other code, it’s not. Sounds easy, is easy. But not really accurate. Because there are times during startup, when the system returns a status code 200, but a blank page. Unfortunate.

So next approach: Check if all bundles are active. Check /system/console/bundles.json and parse it. Look for a statement like this:

status":"Bundle information: 447 bundles in total - all 447 bundles active.

Nice try, but does not work. All bundles being up does not guarantee, that all the services are up as well.

The third approach is more compplicated and requires coding, but delivers good results: Build a healthcheck which depends on a lot of other services (the ones you consider important). If this healthcheck is present and delivers ok, it means, that all services it depends on are active as well (the simple default semantic of the @Reference annotation guarantees that). This does not necessarily mean, that the startup is finished, but just that the services you considered relevant are up.

And finally there is a fourth approach, which has been built specifically for this case: The startup listeners. It’s a service interface you can implement, and you get notified when the system is up. That’s it. The API does not give any guarantee that if the system is up, that 5 minutes later it is still up. I am not 100% sure so the semantics of this approach if a service fails to start. Or if a service decides to stop (or starts throwing exceptions).

The healthcheck is my personal favorite. It can be used not  only to give you information about a single event (“the system is up”), but it can take much more factors into account to decide if the system is up. And these factors can be constantly checked. When a service is no longer available, the healthcheck goes to ERROR (“red”), and it’s available again, the healthcheck reports OK again. The approach is more powerfull, provides better extensibility and is quite easy to understand. So I choose a healthcheck everytime when I need to know about the health state of AEM.