Category Archives: pattern

AEM coding pattern: Run-mode specific code

It is very typical how have code which is supposed to run not on all environments, but only on some. For example you might have code which is supposed to import data only on the authoring instance.

Then code often looks like this:

if (isAuthoring) {
  // import data
}

boolean isAuthoring() {
  return slingSettingsService.getRunmodes().contains("author");
}

This code does what it’s supposed to do. But there can be a problem, that you want to run this code not only on authors, but (for whatever reasons) also on publish. Or you don’t want to run the code on UAT authors.
In such cases this code does not work anymore, because it’s not flexible enough (the runmode is hardcoded); any change requires a code change and deployment.

A better way is to reformulate the requirement a bit: “The data import will only run if there is the ‘importActive’ flag set to true”.

If you design this flag “importActive” as an OSGI config, and combine it with runmode dependent configuration, then you can achieve the same behaviour as above, but be much more flexible. You can even disable it (and if only for a certain time).

The code could then look like this

@Property (boolValue="true")
private static final String IMPORT_ACTIVE_PROP = "importActive";
private boolean importActive();

Protected activate(ComponentContext ctx) {
  importActive = PropertiesUtil (ctx.getProperties().get(IMPORT_ACTIVE_PROP));
}

if (importActive) {
  // import data
}

Now you translate the requirement “imports should only happen on authoring” into a configuration decision, and it’s no longer hardcoded. And that’s why the reason why I will be picky on code reviews about it.

Do not write to the repository in event handling

Repository writes are always resource intensive operations, which always come with a cost. First of all, the write operation helds a number of locks, which limit the concurrency in write operations in total.  Secondly the write itself can take some time, especially if the I/O is loaded or you are running in a cluster with MongoDB as backend; in this case it’s the latency of the network connection plus the latency of MongoDB itself. And third, every write operations causes async index updates, triggers JCR observation and Sling resource events etc. That’s the reason why you shouldn’t take write operations too easy.

But don’t be scared to write to the repo because of performance reason, no way! Instead try to avoid unnecessary writes. Either batch them and collapse multiple write operations into a single transaction, if the business case allows it. Or avoid the repository writes alltogether, especially if the information is not required to be persisted.

Recently I came across a very stressing pattern of write operations: Write amplification. A Sling Resource Event listener was listening for certain write events to the repository; and when one of these was received (which happened quite often), a Sling Job has been created to handle this event. And the job implementation just did another small change to the repository (write!) and finished.

In that case a single write operation resulted in:

  • A write operation to persist the Sling Job
  • A write operation performed by the Job implementation
  • A write operation to remove the Sling Job

Each of these “regular” write operations caused 3 subsequent writes to the repository, which is a great way to kill your write performance completely. Luckily no one of these 3 additional write operations caused the event listener to create a new Sling Job again … That would have caused the same effect as “out of office” notifications in the early days of Microsoft Exchange (which didn’t detect these and would have sent an “out-of-office” reply to the “out-of-office” sender): A very effective way of DOSing yourself!

a flood of writes and the remains of performance

But even if that was not the case, it resulted in a very loaded environment reducing the subjective performance a lot; threaddumps indicated massive lock contention on write operations. When these 3 additional writes have been optimized (effectivly removed, as collecting the information in memory and batch-writing it after 30 seconds was possible) the situation improved a lot, and the lock contention was gone.

The learning you should take away from this scenario: Avoid writing to the repository in JCR Observation or Sling event listeners! Collect the data and write outside of these handlers in a separate thread.

PS: An interesting side effect of sling event listeners taking a long time is, that these handlers are blacklisted after they took more than 5 seconds to process (e.g. because they have been blocked on writing). Then they are not fired again (until you restart AEM), if you don’t explicitly whitelist them or turn of this feature completly.

Design pattern: Configuration of OSGI services

When you are an AEM backend developer, the pattern is very familiar: Whenever you need to provide configuration data to the service, you collect this data in the activate() method (by good tradition that’s the name of the method annotated with the “@Activate” annotation). I use this pattern often and normally it does not cause any problems.

But recently in my current project we ran into an issue which caused headaches. We needed to provide an API Key which is supposed to change every now and then, and therefor is not configured by an OSGI property, but instead stored inside the repository, so it can be authored.

We deployed the code, entered the API key, and … Guess what? It was not working at all. The API key was read in the Activate method, but at the time the key was not yet present. And the only chance to make it work was to restart the service/bundle/instance. And besides the initial provisioning it would have required a restart every time the key has been changed.

That’s not a nice situation when you try to automate your deployment (or not to break your automated deployment). We had to rewrite our logic in a way, that the API key was read periodically (every minute) from the repository. Of course the optimal way would have been to use JCR observation or an Sling Event Handler to detect any changes on the API Key Node/Resource immediately …

So whenever you have such “dynamic” configuration data, you should design your code in a way, that it can cope with situations that this configuration is not there (yet) or changes. The least thing you want to do is to restart your instance(s) because such a configuration change has happened.

Let’s formulate this as an pattern: Do not read from the repository in the “activate” method of a service! The content you read might change during runtime, and you need to react on it.

AEM coding best practice: No String operations on paths!

I recently needed to review project code in order to validate if it makes problems when upgrading from AEM 5.6 to AEM 6.x; so my focus wasn’t on the code in the first place, but on some other usual suspects (JCR queries etc). But having seen a few dozen classes I found a pattern, which I then found all over the code: the excessive use of String operations. With a special focus on string operations on repository paths.

For example something like this:

String[] segments = resource.getPath().split("/");
String settingsPath = "/" + StringUtils.join(segments,"/",0,2) + "/settings/jcr:content";
Resource settings = resourceResolver.get(settingsPath);
ValueMap vm = settings.adaptTo(ValueMap.class);
String language = vm.get("language");

(to read settings, which are stored in a dedicated page per site).

Typically it’s a weird mixture of String methods, the use of StringUtils classes plus some custom helpers, which do things with ResourceResolvers, Sessions and paths. Spread all over the codebase. Ah, and it lacks a lot of error checking (what if the settings page doesn’t contain the “language” property? adaptTo() can return “null”).

Sadly that problem not limited to this specific project code, I found it in many other projects as well.

Such a code structure is a clear sign for the lack of abstraction and guidance. There are no concepts available, which eliminate the need to operate on strings, but the developer is left with the only abstraction he has: The repository and CRXDE Lite’s view on it. He logs into the repository, looks at the structure and then decides how to mangle known pieces of information to get hold of the things he needs to access. If there’s noone which reviews the code and rejects such pieces, the pattern emerges and you can find it all over the codebase. Even if developers create utility classes for (normally every developer creates one on its own), it’s a bad approach, because these concepts are not designed (“just read the language from the settings page!”), but the design “just happens“; there is no focus on it, therefor quality is not enforced and error handling typically quite poor.

Another huge drawback of this approach: It’s very hard to change the content structure, because at many levels assumptions about the content structure are used, which are often not directly visible. For the example the constants “0” and “2” in the above code snippets determine the path prefix, but you cannot search for such definitions (even if they are defined as constant values).

If the proper abstraction would be provided, the above mentioned code could look like this:

String language = "en";
Settings settings = resource.adaptTo(Settings.class);
If (settings != null) {
  language = settings.getLanguage();
}

This approach is totally agnostic of paths and the fact that settings are stored inside a page. It hides this behind a well-defined abstraction. And if you ever need to change the content structure you know where you need to change your code: Only in the AdapterFactory.

So whenever you see code which uses String operations on paths: Rethink it. Twice. And then come up with some abstraction layer and refactor it. It will have a huge impact on the maintainability of your code base.

JCR Observation in clustered AEM instances

Clustering AEM got a bit different with the introduction of OAK. But with the enforcement of the MVCC model in Oak I also advise to revisit some patterns you might got used to. Because some code which worked with no apparent problem in AEM 5.x might cause problems now.

One thing I would check are the JCR Observation Listeners. Using JCR observation is a common way to react on changes in the repository and this is common pattern since CQ 5.0. So what’s the problem with that? The problem is that many JCR observation handlers are not written with clustering in mind.

Take the example that you need to react on changes in the repository and in turn modify something else. The usual approach is to have a service like this (omitting a lot of the boilerplate …)

public class MyListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  ObservationManager om = session.getWorkspace().getObservationManager();
  om.addEventListener (this, 
   Event.NODE_ADDED,
   "/content/mysite",
   null,
   new String[]{"cq:Page"},
   true,
   true);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }

}

This works very well in any non-clustered environment, because there is only a single event handler performing these changes. In clustered environments the situation is different, because now on each cluster node there is such a event handler active. And each one wants to perform the repository changes.
In that case you’ll see a lot of Oak exceptions (on all cluster nodes) which indicate that nodes have been modified externally (outside of the current session) and that a merge was not possible. This is because the changes happen in (quasi-) parallel, but not visible to the currently open sessions, thus causing these exceptions.

The only solution to this problem is to execute the EventListener only on a single node or to handle every event by exactly one event handler and not on all.

Handling every observation event on exactly handler is the elegant and scalable solution. The idea is to handle on every cluster node only the changes which happen on this cluster nodes („local events“). While the JCR API doesn’t have any notion of cluster and the Observation API does not give any information if a event is local or not, the Jackrabbit implementation (which Oak is using here) supports this through the JackrabbitObservationManager. As you can see in the following snippet, only the registration of the ObservationHandler changes, but not the handler itself.

public class MyScalableListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  JackrabbitEventFilter ef = new JackrabbitEventFilter()
   .setAbsPath("/content/mysite")
   .setNodeTypes(new String[{"cq:Page"})
   .setEventTypes(Event.NODE_ADDED)
   .setIsDeep(true)
   .setNoExternal(true);
  JackrabbitObservationManager om = (JackrabbitObservationManager) session.getWorkspace().getObservationManager();
  om.addEventListener (this, ef);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }
}

Through the Jackrabbit API extension you can register you EventListener to only handle local changes only and ignore any external ones, which are generated on another cluster nodes (using the setNoExternal(true) call). This is a scalable solution because the events handled at the location where they are generated, and no cluster nodes gets a bottleneck because of this.

So whenever you write an ObservationHandler and especially when you use a cluster, you should review your code and make sure, that you avoid concurrent access to the same resource. Of course there are many ways to have concurrent access even without clustering, but when you actually use clustering, the JCR observation handlers are the easiest piece of code to check and fix.

Resource path vs URL and rewriting links

Today I want to discuss some aspects of an AEM application, which is rarely considered during application development, but which normally gets very important right before a golive: the path element of a URL, and how it is constructed (either in full version or in a shortened one).

Newcomers to the AEM world sometimes ask how the public URLs are determined and maintained; from their experience with older or other CMS systems pages have an ID and this ID has to be mapped somehow to a URL.
Within AEM this situation is different, because the author creates a page directly in the tree structure of a site. And the name of the page can be directly mapped to a URL. So if an author creates a page /content/mysite/en/news/happy-new-year-2016, this page can be reached via https://HOST/content/mysite/en/news/happy-new-year-2016.html (in the simplest form).

From a technical point of view, the resource path is mapped to the path-element of a URL. In many cases this is a 1:1 mapping (that means, that the full resource path is taken as path of the URL). Often the „many“ means „in development environments“, because in production environments these kinds of URLs are long and contain redundant informations, which is something you should avoid. A URL also contains a domain part, and this domain part often carries information, so it isn’t needed in the path anymore.
So instead of „https://mysite.com/content/mysite/en/news.html“ we rather prefer „https://mysite.com/en/news.html“ and map only a subset of the resource path.

When mapping the resource path to a URL you must be careful, because the other way (the mapping of URL to resource path) has to work as well, and there must be exactly 1 mapping.

Such kind of mappings (I often call the mapping „resource path to URL path“ a forward mapping and the „URL path to resource path“ a reverse mapping) can be created using the /etc/map mechanisms . In a web application you need to use both mappings:

  1. when the request is received the URL path has to get mapped to a resource, so the sling resource processing can start.
  2. When the rendered page contains links to other pages, the resource path of these pages has to be provided as URL path.

(1) is done automatically by the sling if the correct ruleset is provided. (2) is much more problematic, because all references to resources provided by AEM have to be rewritten. All references? Generally spoken yes, I will discuss this later on.

This mapping can be done through the 2 API methods of the resource resolver:

You might wonder, why you never use these 2 methods in your own code,even if I wrote above, that all the links to other pages need to rewritten. Basically you don’t have to do this, because the HTML created by the rendering pipeline (including all filters) is streamed through the Sling Output Rewriting Pipeline. This chain contains a rewriter rule, which scans through all the HTML and tries to apply a forward mapping to all links.

But it does only run on HTML output, but there are other elements of a site, which contain references to content stored in AEM as well, for example Javascript or CSS files. References contained in these files are not rewritten, but delivered as they are stored in the repository. In many cases the setup is designed in a way, that a 1:1 mapping still works; but that’s not always possible (or wanted).

So please take this as an advice: Do not hardcode a path in CSS or Javascript files if there’s a chance that these paths need to be mapped.
Rewriting other formats than HTML is not part of AEM itself; of course you can extend the defaults and provide a rewriting capability for Javascript and CSS as well, but that’s not an easy task.)

The question is, if you really have to rewrite all resource paths at all. In many cases it is ok just to have the URLs of the HTML pages looking nice (because these are the only URLs which are displayed prominently) . But all the other resources (e.g assets, CSS and Javascript files) don’t need to get mapped at all, but there the default 1:1 mapping can be used. Then you’re fine, because you only have to do the mapping once in /etc/map and that’s it.

The Apache mod_rewrite modules also offers very flexible ways to do reverse mapping, but it lacks the a way to apply a forward mapping to the HTML pages (as the Sling Output Rewriter does). So mod_rewrite is a cool tool, but it is not sufficient to completely cover all aspects of resource mapping.

How can I avoid Oak write/merge conflicts?

Sandeep asked in a comment to the previous posting:

Even if your sessions are short, and you have made a call to session.refresh(true), it is possible that some one made a change before you did a session.save(), right? So, what is the best practice in dealing with such a scenario?
Keep refreshing (session.save(true)) in a loop, until your session.save() is successful or until you hit an assumed maximum number of attempts limit?
Or is there any other recommended best practice?

That’s a good question, but also a question with no satisfying answer. Whenever you want to modify nodes in the repository, there’s a change that the same nodes are changed in parallel, even you change only a single node or property. In reality, this rarely happens. Most features in AEM are built in way, that each step (or each workflow instance, each Sling job, each replication event, etc) has its own nodes to operate upon. So concurrency must not provoke such a situation, that multiple concurrent operations compete for writes on a single node.
So from a coding perspective it should possible to avoid such situations. Not only because of this kind of problems, but also because of performance and debugging.

Something you cannot deal with in this way are author changes. If 2 authors decide to change a page at the same time, it’s likely that they screw up the page. You can hardly avoid that just using code. But if you cannot guarantee from a work organization point of view, that no 2 persons work at the same page at the same time, teach your authors to use the „lock“ feature. I basically prevents other authors from making changes temporarily. But according to the Oak documentation it isn’t suited to be used as short-living locks (in a database sense), but rather longer-living locks (author locks a page to prevent other authors from editing it).

So, to come a conclusion to Sandeeps question: It depends. If you designed your application carefully, you should rarely come into such situations, that you compete with multiple threads for a node. But whenever it occurs it should be considered as a bug, analyzed and then get fixed.
But there can be other cases, where this approach could make sense. In any case I would retry a few times (e.g. 10) and then break the operation with a meaningful log message. But I don’t think that it’s good to retry indefinitely.