Author Archives: Jörg

Application changes and incompatible features

In the lifecycle of every larger application there are many occasions where features evolve. In most cases the new version of a feature is compatible with earlier versions, but there are always cases where incompatible changes are required.

As example let me take the scenario the guys at KD WebConsult provided in their latest blog entry (which has inspired me to write this post, I have to admit). There is a component which introduces changes in the underlying repository structure, and the question is how to cope with that change in case of deployments.

I think that is a classical case of incompatible changes, which always result in additional effort; and that’s the reason why noone likes incompatible changes and tries to avoid them as much as possible. While in a pure AEM environment you should have the full control of all changes, it’s getting harder if you have system you depend on or systems depending on you. Then you run into the topic of interface lifecycle management. Making changes then gets hard and sometimes nearly impossible. You end up with supporting multiple versions of an interface at the same time. Duplicating code is then a way to cope with it. (The technical debt is not only on your side then, but also on the side of others no updating or able to update their use of the interfaces.)

So to come back to the KD Webconsult example I think that the cleanest solution is to build your component in a way, that it supports both the old and new the repository structure (their option 2). And if you are sure, that you don’t use the old structure anymore, you can safely remove the logic for it.

The thinking, that you can always avoid such situations and keep you code clean, is wrong. As soon as you are dealing with non-trivial setup (and AEM in an enterprise setup per se is a distributed application which comes along with other enterprisy requirements like high-availability) you have to make compromises. And taking technical debts for the time of a release or two is not necessarily a bad one if you can stick with standard processes (not changing deployment processes).

JCR Observation in clustered AEM instances

Clustering AEM got a bit different with the introduction of OAK. But with the enforcement of the MVCC model in Oak I also advise to revisit some patterns you might got used to. Because some code which worked with no apparent problem in AEM 5.x might cause problems now.

One thing I would check are the JCR Observation Listeners. Using JCR observation is a common way to react on changes in the repository and this is common pattern since CQ 5.0. So what’s the problem with that? The problem is that many JCR observation handlers are not written with clustering in mind.

Take the example that you need to react on changes in the repository and in turn modify something else. The usual approach is to have a service like this (omitting a lot of the boilerplate …)

public class MyListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  ObservationManager om = session.getWorkspace().getObservationManager();
  om.addEventListener (this, 
   Event.NODE_ADDED,
   "/content/mysite",
   null,
   new String[]{"cq:Page"},
   true,
   true);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }

}

This works very well in any non-clustered environment, because there is only a single event handler performing these changes. In clustered environments the situation is different, because now on each cluster node there is such a event handler active. And each one wants to perform the repository changes.
In that case you’ll see a lot of Oak exceptions (on all cluster nodes) which indicate that nodes have been modified externally (outside of the current session) and that a merge was not possible. This is because the changes happen in (quasi-) parallel, but not visible to the currently open sessions, thus causing these exceptions.

The only solution to this problem is to execute the EventListener only on a single node or to handle every event by exactly one event handler and not on all.

Handling every observation event on exactly handler is the elegant and scalable solution. The idea is to handle on every cluster node only the changes which happen on this cluster nodes („local events“). While the JCR API doesn’t have any notion of cluster and the Observation API does not give any information if a event is local or not, the Jackrabbit implementation (which Oak is using here) supports this through the JackrabbitObservationManager. As you can see in the following snippet, only the registration of the ObservationHandler changes, but not the handler itself.

public class MyScalableListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  JackrabbitEventFilter ef = new JackrabbitEventFilter()
   .setAbsPath("/content/mysite")
   .setNodeTypes(new String[{"cq:Page"})
   .setEventTypes(Event.NODE_ADDED)
   .setIsDeep(true)
   .setNoExternal(true);
  JackrabbitObservationManager om = (JackrabbitObservationManager) session.getWorkspace().getObservationManager();
  om.addEventListener (this, ef);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }
}

Through the Jackrabbit API extension you can register you EventListener to only handle local changes only and ignore any external ones, which are generated on another cluster nodes (using the setNoExternal(true) call). This is a scalable solution because the events handled at the location where they are generated, and no cluster nodes gets a bottleneck because of this.

So whenever you write an ObservationHandler and especially when you use a cluster, you should review your code and make sure, that you avoid concurrent access to the same resource. Of course there are many ways to have concurrent access even without clustering, but when you actually use clustering, the JCR observation handlers are the easiest piece of code to check and fix.

Why Adobe is a great company to work for

About 6 years ago, while I was working in the technical consulting business for a Swiss company called Day Software AG, I got the information that the US software company Adobe had acquired Day because of its only product, CQ5. It was a bit surprising for me, because I had not perceived Adobe as a player in the enterprise web world. I assumed Adobe sold box software for creatives, and that the only „relevant” contribution of Adobe to the web was Flash. I was also quite hesitant because I had loved Day for its openness. So I, as a consultant, had full access to source code and all the developers, and that approach was not something I would associate with an 25 year old „traditional“ American software company. But I gave it a try.

It was a good decision. I was assigned to the German consulting group, which, at the time, was small team (about 15-20 persons) of very talented people, all with no CQ5 experience at all. So 3 former Day consultants formed the core of this CQ5 consulting team, which now (in 2016) has grown to over 25 consultants doing projects with AEM (which is the evolution of CQ5), plus many others which implement projects with other products of the Adobe Marketing Cloud.

Despite my initial fear, Adobe adopted the open development culture of Day, and from my perception with huge success. The internal discussions between individual engineers, consultants, product managers, support, documentation teams and others take place on a permanent basis on internally open mailing lists, offering a huge pool of information to many topics, regarding AEM and beyond. So no trace of the silo-ed organization I was afraid of.

Adobe offered me a lot of chances to grow, excel, take ownership of topics and demonstrate leadership. Based on the initiative of some consultants, we were able to fund and staff the creation and delivery of an internal AEM architect course, which is highly requested (I am delivering it about twice a year). Many people are recognized as experts in specific areas because they are quite active on internal mailing lists offering help in their areas and sharing experiences from projects they did. Or in the even more informal Slack channels, where participants discuss problems, possible solutions or maybe only advise how to get the right office table for the remote (home) office. There are a lot of ways to demonstrate your skills inside (and outside!) of Adobe.

And of course the projects I worked on in the last years. Small or large, long-term or only for some days to solve a problem. A “fire-fighting” engagement to solve urgent stability issues, or review of an architecture. I did them all. For me, learning the different facets of CQ5 and AEM under many different requirements was one of the key drivers which made me stay with Adobe.
Last but not least I work in a great team. Although the consulting business and the area to cover makes it sometimes hard to meet each other, we try hard to keep in touch at least virtually. Informal in-person meetings of the regional sub-teams take place quite often, and the weekly team call (although optional) is widely used.

This year I was able to do something absolutely great: Within regular PTO, my family and I were able to do a 5-week tour through the United States, visiting some major landmarks and lots of friends. A huge „thank you“ to my managers to make this possible!

So Adobe is a great place to work. Come and join us!

Automation done right — a reliable process

(the subtitle could be: Why I don’t call simple curl scripts „automation“).

Jakub Wadalowski mentioned in his talk on CircuitConf 2016, that „Jörg doesn’t like curl“. Which is not totally true. I love curl as a command line http client (to avoid netcat).

But I don’t like the way how people are using curl in scripts and then call these scripts „automation“; I already blogged about this topic.

Good AEM automation requires much more than just package installation. True automation has to work in a way, that you don’t need to perform any manual intervention, supervision or control. The automation has to work in a reliable and repeatable fashion.
But we also know, that there will ever be errors and problems, which prevent a successful result. So a package installation can work, but the bundles might come not up. In such cases the system has to actively report problems back. But this has to be accurate in a way, that you can rely on it: If no messaging occurs, everything is fine! There are no manual checks anymore!
This really requires a process you can trust. You need to trust your automated process so much, that you can start it and then go home.

So to come back to the „Jörg doesn’t like curl“ statement of Jakub: When you just have the curl calls and no error checking (I am not aware of an easy way to determine the status code of the HTTP request done with curl) and no proper error handling and reporting, it will never be reliable. It might save you a few clicks, but in the end you still have to perform a lot of manual steps. And to get away from these manual steps just with shell scripting, it requires a good amount of scripting.

And to be honest: I’ve rarely seen „good shell scripts“.

Storage sizing and SAN

A part of every project I’ve done in the last years was always the task to create a hardware sizing; many times it was part of the project setup and was a very important piece which got fed into the hardware provisioning process.

(Hardware sizing is often required during presales stages, where it is mostly used to compare different vendors in terms of investment into hardware. In such situations the process to create a sizing is very similar, but the results are often communicated quite differently …)

Depending on the organization this hardware provisioning process can be quite lengthy, and after the parameters have been set they are very hard to change. That is especially true with large organisations which want to use an on-premise deployment in their own datacenter; because then it means starting a purchase process followed by provisioning and installation, and so on. Especially in the FSI area it is not uncommon to have 6 months from the signed order of the budget manager to the handover of an installed server. And then you get exactly what you asked for. So everyone tries to make sure, that the input data is as good as possible, and that all variables have been considered. Reminds me a lot of the waterfall model in the software development business, and it comes with the same problems.

Even if your initial sizing was 100% accurate (which it never is), 6 months are a long time where also some important project parameters can change. So there is a not-so-small chance, that at the time the servers are handed over to you, you know that the hardware is not sufficient anymore, based on the information you have right now.
But changes are hardly possible, because for cost efficiency you ordered not the hardware which offered the most flexibility in terms of future growth, but a model with some constraints in terms of extendability. And now, 6 months after project start and just 4 months ahead of the golive date, you cannot restart the purchasing process anymore!

Or a bit worse: You are already live for 6 months and now you start to run short of the disk space, because your growth is much higher than anticipated. But the drive bays of your servers are already full and you have already implemented the largest disks available.

For the topic of disk space the solution can be quite easy: Don’t use local disks! Even if local SSDs are very performant in terms of IOPS, try to avoid the discussion and go for a SAN (Storage area network), which should be available already in an enterprise datacenter. (Of course you can also choose any different technology, which decouples the storage from the server in a strong way and performs well.) For AEM and TarMK a good SAN is sufficient to deliver decent performance (even if a local SSD improves this again).

I know that this statement can be problematic, as there are cases where IOPS are more important than the flexibility to easily add storage. Then your only chance is to take the SSDs and make sure, that you still have the chance to add more of them.

The benefit of a SAN is that you split the servers from their storage, and you can upgrade or extend them independently from each other. Adding more hard drives to a SAN is always possible, so you have hardly a limit in terms of available disk space per server. Attaching more disk space to a server is then a matter of minutes and can be done incrementally. This allows you also to attach disk space on demand instead of attaching the full amount of disk space on provisioning time (and consuming the full amount 2 years later).
And if you have servers and storage split up, it is also much easier to replace a server by another model with more capacity (RAM or CPU-wise), because you don’t need to move all data but rather just connect the storage to a different server.

So using a SAN does not free you up from delivering a good sizing, but it can soften the impacts of an insufficient sizing (mostly based on insufficient data), which is often the case on project kickoffs.

Managing repository growth

On almost every project there is this time, when (all of a sudden) the disk space assigned to AEM instances becomes full; in most (!) cases such a situation is detected early on so there is time to react, which often means just adding more disk capacity.

But after that the questions arise: Why is our repository so large? Why does it consume more than the initially estimated disk space? What content exactly is causing this problem? What went wrong so we actually got into that situation?

From my point of view there are 3 different views on this situation, which can be an answer:

  • Disk space is not managed well, that means that unnecessary stuff is consuming lot of space outside of the repository.
  • The maintenance jobs are not executed.
  • The estimation for content and content growth was not realistic.

And in most cases you need to check all these 3 views to answer the question “Why are we running out of space?”

Disk space is not managed well
This is an operations problems in the first place, because non-repository data is basically consuming disk space which has been planned for the repository. Often seen:

  • Heapdumps and log files are not rotated or removed.
  • Manually created backups files have just been copied to a folder next to the original one once, but in meanwhile they are not useful any more, because they are totally out of date.
  • The regular backup process is creating temporary files, which are not cleaned up; or the backup process itself consumes temporary disk space, which let sthe disk consumption spike.

So this can just be handled by careful working and in-time purging of old data.

The maintenance jobs are not executed
Maintenance jobs are an essential part of the ongoing job to remove the unnecessary fat from you AEM instance, be it on the content level or on a repo level. It includes

  • workflow purge
  • audit log purge
  • repository compaction (if you use TarMK)
  • datastore GC (if you use a datastore)

You should always keep an eye on these; the Maintenance Dashboard is a great help here. But do rely on it blindly

Your estimation for content and content growth was nor realistic
That’s a common problem; you have to give an initial hardware sizing, which also includes the amount of disk space used by the repository. You do your best to include all relevant parameters, you add some buffer on top. But that is an estimation on the beginning of the project, when you don’t know all the requirements and their impact on disk consumption in detail. But that’s what you said, and changing them afterwards is always problematic.

Or AEM is used differently than initially anticipated and all the assumptions you have based your initial hardware sizing are not longer true. Or you just forgot to add the versioning of the assets to your calculation. Or…
There are a lot of cases where in retrospective the initial sizing of the required disk space was just incorrect. In that case you have only chance: Redo the calculation right now! Take your new insights and create a new hardware sizing. And then implement it and add more hardware.

And the only way I see to avoid such situations is: Do not make estimations for the next 3 years! Be a agile and review your current numbers every 3 months; during this review you can also determine the needs for the next months and plan accordingly. Of course this assumes that you are flexible in terms of disk sizing, so for any non-trivial setup the use of SAN as storage technology (plus a volume manager) is my preferred choice!

Of course this does not free yourself from working on the cleanup of the systems and running all required maintenance jobs; but it will make the review of the used disk space a regular task; so you should see deviations from your anticipated plan much earlier.

What is writing to my Oak repository?

If you were ever curious what’s happening on the Oak repository in AEM (and I can tell: a lot!), there’s a chance to logs all the repo write actions.

Just set create a logger for „org.apache.jackrabbit.oak.jcr.operations.write“ on loglevel TRACE and there you go.


21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:userid]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:path]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:type]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:category]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/userId]
21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/path]
21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/type]
21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:properties]
21.05.2016 23:38:34.356 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] save

This thread with the name [Thread-8145] writes to the /var/audit path, so it’s quite likely related to the Audit logger. And this is using the session [session-43731]; the session name is random per session, but it is a very useful information:

  • you can identify a single session and what’s happening within this session. It is especially useful to determine what this specific session is writing and how often there are saves.
  • If you have the session name, and it is a long-running session, you can look up this session in the JMX MBean console; in Oak 1.0 and Oak 1.2 a stack trace is stored when the session is opened; in Oak 1.4 this has been removed for performance reasons, but you can get it back when you set the System property ‚oak.sessionStats.initStackTraceThreshold‘ to ‚0‘ (zero).

So if you need to understand what’s happening on your repository and what might causing repository write activity, this is an easy way to go. The only drawback: Such logging eats up a lot of diskspace, especially if run it for an extended period of time.

It’s also possible to do this for reads, but at least on Oak 1.0 it doesn’t log the paths, but only the operation; so it’s less useful here. And it produces a lot of data: Having it enabled for 2 minutes and 1 page load on my local instance it produced 10 megabytes of logs.