Author Archives: Jörg

Validating AEM content-packages

A typical task when you run AEM as a platform is deployment. As platform team you own the platform, and you own the admin passwords. It’s your job to deploy the packages delivered by the various teams to you. And it’s also your job to keep the platform reliable and stable.

With every deployment you have the chance to break something. And not only the part of the platform which belongs to the team which code you deploy. That’s not a problem, if their (incorrect) code breaks their parts. But you break the system of other tenants, which are not involved at all in the deployment.

This is one of the most important tasks for you as platform owner. A single tenant must not break other tenants! Never! The problem is just, that it’s nearly impossible to guarantee. You typically rely on trust towards the development teams and that they earn that trust.

To help you a little bit with this, I created a simple maven plugin, which can validate content-packages against a ruleset. In this ruleset you can define, that a content-package delivered by tenant A will only contain content paths which are valid for tenant A. But the validation should fail, if the content-package would override clientlibraries of tenant-B. Or which will introduce new overlays in /apps/cq. Or which introduces a new OSGI setting with a non-project PID. Or anything else which can be part of a content-package.

Check out the the github repo and the README for its usage.

As already noted above, it can help you as a platform owner to ensure a certain quality of the packages you are supposed to install. On the other hand it can help you as project team to establish a set of rules which you want to follow. For examples you can verify a “we don’t use overlays” policy with this plugin as part of the build.

Of course the plugin is not perfect and you still can easily bypass the checks, because it does not parse the .content.xml files in there, but just checks the file system structure. And of course I cannot check bundles and the content which comes with them. But we all should assume that no team wants to break the complete system when deployment packges are being created (there are much easier ways to do so), but we just want to avoid the usual errors, which just happens when being under stress. If we catch a few of them upfront for the cost of configuring a rulset once, it’s worth the effort 🙂

Detecting JCR session leaks

A problem I encounter every now and then are leaking JCR sessions; that means that JCR sessions are opened, but never closed, but just abandoned. Like Files, JCR sessions need to be closed, otherwise their memory is not freed and they cannot be garbage collected by the JVM. Depending on the number of sessions you leave in that state this can lead to serious memory problems, ultimately leading to a crash of the JVM because of an OutOfMemory situation.

(And just to be on the safe side: In AEM ootb all ResourceResolvers use a JCR session internally; that means whatever I just said about JCR sessions applies the same way to Sling ResourceResolvers.)

I dealt with this topic already a few times (and always recommended to close the JCR sessions), but today I want to focus how you can easily find out if you are affected by this problem.

We use the fact that for every open session an mbean is registered. Whenever you see such a statement in your log:

14.08.2018 00:00:05.107 *INFO* [oak-repository-executor-1] com.adobe.granite.repository Service [80622, [org.apache.jackrabbit.oak.api.jmx.SessionMBean]] ServiceEvent REGISTERED

That’s says that an mbean service is registered for a JCR session; thus a JCR session has been opened. And of course there’s a corresponding message for unregistering:

14.08.2018 12:02:54.379 *INFO* [Apache Sling Resource Resolver Finalizer Thread] com.adobe.granite.repository Service [239851, [org.apache.jackrabbit.oak.api.jmx.SessionMBean]] ServiceEvent UNREGISTERING

So it’s very easy to find out if you don’t have a memory leak because of leaking JCR sessions: The number of log statements for registration of these mbeans must match the number of log statements for unregistration.

In many cases you probably don’t have exact matches. But that’s not a big problem if you consider:

  • On AEM startup a lot of sessions are opened and JCR observation listeners are registered to them. That means that a logfile with AEM starts and stops (and the number of starts do not match the number of stops) it’s very likely that these numbers do not match. Not a problem.
  • The registration (and also the unregistration) of these mbeans often happens in batches; if this happen during logfile rotation, you might have an imbalance, too. Again, not per se a problem.

It’s getting a problem, if the number of sessions opened is always bigger than the number of sessions closed over the course of a few days.

$ grep 'org.apache.jackrabbit.oak.api.jmx.SessionMBean' error.log | grep "ServiceEvent REGISTERED" | wc -l
212123
$ grep 'org.apache.jackrabbit.oak.api.jmx.SessionMBean' error.log | grep "ServiceEvent UNREGISTERING" | wc -l
1610
$

Here I just have the log data of a single day, and it’s very obvious, that there is a problem, as around 220k sessions are opened but never closed. On a single day!

To estimate the effect of this, we need to consider that for every of these log statements these objects are retained:

  • A JCR session (plus objects it reaches, and depending on the activities happening in this session it might also include any pending change, which will never going to be persisted)
  • A Mbean (referencing this session)

So if we assume that 1kb of memory is associated with every leaking session (and that’s probably an very optimistic assumption), this would mean that the system above would loose around 220M of heap memory every day. This system probably requires a restart every few days.

How can we find out what is causing this memory leak? Here it helps, that Oak stores the stack trace when opening sesions as part of the session object. Since around Oak 1.4 it’s only done if the number of open sessions exceeds 1000; you can tune this value with the system property “oak.sessionStats.initStackTraceThreshold”; set it to the appropriate value. This is a great help to find out where the session is opened.

And then go to /system/console/jmx, check for the “SessionStatistics” mbeans (typically quite at the bottom of the list) and select on the most recent ones (they have the openening date already in the name)

session information in the mbean view

session information in the mbean view

And then you can find in the “initStackTrace” the trace where this session has been opened:

Stacktrace of an open JCR session

Stacktrace of an open JCR session

With the information at hand where the session has been opened it should be obvious for you to find the right spot where to close the session.
If you spot a place where a session is opened in AEM product code but never closed, please check that with Adobe support. But be aware, that during system startup sessions are opened and will stay open while the system is running. That’s not a problem at all, and please do not report them!

It’s only a problem if you have a at least a few hundreds session open with the very same stack trace, that’s a good indication of such a “leaking session” problem.

A good followup reading on AEM HelpX pages with some details how you can fix it.

Referencing runmodes in Java

There was a question at this year’s AdaptTo, why there is no Java annotation to actually limit the scope of a component (imagine a servlet) to a specific runmode. This would allow you to specify in Java code, that a servlet is only supposed on run on author.

Technically it is easily possible to implement such an annotation. But it’s not done for a reason. Because runmodes have been developed as deployment vehicle to ship configuration. That means your deployment artefact can contain multiple configurations for the same component, and the decision which one to use is based on the runmode.
Runmodes are also not meant to be used as differentiator so code can operate differently based on this runmode. I would go so far to say, that the use of slingSettings.getRunModes() should be considered bad practice in AEM project code.

But of course the question remains, how one would implement the requirement that something must only be active on authoring (or any other environment, which can be expressed by runmodes). For that I would like to reference an earlier posting of mine. You still leverage runmodes, but this time via an indirection of OSGI configuration. This avoids hardcoding the runmode information in the java code.

Content architecture: dealing with relations

In the AEM forums I recently came across a question about slow queries. After some back and forth I understood that the poster wanted to do thousands of such queries to render a page. When rendering a product page he wanted to references the assets associated to it.

For me the approach used by the poster was straight forward, based on the assumption that the assets can reside anywhere within the repository. But that’s rarely the case. The JCR repository is not a relational database, where all you have are queries. With JCR you can also iterate through the structure. It’s a question about your content architecture and how you map it to AEM.

That means, that for such requirements like described you can easily design your application in a way, that all assets to a product are stored below the product itself.

Or for each product page there is a matching folder in the DAM where all the assets reside. So instead of a JCR query you just do a lookup of a node at a fixed location (in the first example below the subnode “assets”) or you can compute the path for the assets (/content/dam/products/prodcut_A/assets). That single lookup will always be more performant than a query, plus it’s also easier for an author to spot and work with all assets belonging to a product.

Of course this is a very simplified case. Typically requirements are more complex, also asset reuse is often required. This approach does not work that easy anymore.
And there is no real recipe for it, but ways how to deal with it.

In case of creating such relations between content we often use tags. Content having the same tag are related, and can be added automatically in the list of related content or assets. Using tags as a level of indirection is ok and in the context of the forum post also quite performant (albeit the resolution itself is powered by a single query).

Another approach to come up with modelling the content structure is to look at the workflows the authoring users are supposed to use. Because they also need to understand the relationship between content, which normally leads to something intuitive. Looking at these details might also give you a hint how it can be modeled; maybe just having the referenced assets as paths as part of the product is already enough.

So, as already said in an earlier post, there are many ways to come up with a decent content architecture, but rarely recipies. In most cases it pays of to invest time into it and consider the effects it has on the authoring workflow, performance and other operational aspects.

HTL – a wrong solution to the problem?

(in reponse to Dan’s great posting:  “A Retrospective on HTL: The Wrong Solution for the Problem”)

Dan writes that with JSP there is a language out there which is powerful and useful, and that’s hardly a good reason for an experienced developer to switch to another language (well, besides the default XSS handling of HTL).

Well, I think I can agree on that. An experienced java web developer knows the limits of the JSP and JSP scriptlets and is capable to develop maintainable code. And to be fair, the code created by these developers is hardly a problem.

And Dan continues:

I am more productive writing JSP code and I believe most developers would be as well, as long as they avoid Scriptlet and leverage the Sling JSP Taglib and Sling Models.

And here I see the problem. Everything works quite well if you can keep up with the discipline. In my experience this works

  • As long as you have experienced developers who make the right decisions and
  • As long as you have time to fix things in the right way.

The problems begin when the first fix is done in the JSP instead of the underlying model. Or logic is created in the JSP instead of the creation of a dedicated model. Having such examples in your codebase can be seen as the begin of something called the broken window theory: it will act as an example how things can be done (and get through with it) unless you start fixing right away.

It requires a good amount of experience as developer, discipline and assertiveness towards your project lead to avoid implementing the quick-fix and doing it right instead, as it typically takes more time. If you live such a culture, it’s great! Congratulations!

If you don’t have the chance to work in such a team — you might need to work with less capable developers or you have a high fluctuation in your team(s) — you cannot trust each individual to do right decisions in 99 per cent of all cases. Instead you need a number of rules, which do not require too much disambiguation and discussion to apply correctly. Rules such as

  • Use HTL (because then you cannot implement logic in the template)
  • Always build a model class (even if you could get away without)

It might not be the most efficient way to develop code, but in the end you can be sure, that certain types of errors do not occur (such as missed XSS protection, large JSP scriptlets, et cetera). In many cases this outweighs the drawbacks of using HTL by far.

In the end using HTL is just a best practice. And as always you can deliberately violate best practices if you know exactly what you are doing and why following the best practices prevents you from reaching your goal.

So my conclusion is the same as Dan’s:

Ultimately, your choice in templating language really boils down to what your team is most comfortable with.

If my team has not proven track record to deliver good JSPs in the past (the teams I worked with in the last years have not), or I don’t know the team very well, I will definitely recommend HTL . Despite all the drawbacks. Because then I know what I get.

Sling Context-Aware configuration (part 7): A conclusion

This the part 7 of my small series on Sling Context-Aware configuration, and probably the final posting of it. Time for a conclusion.

CAC adresses a need I often see in projects: having configuration which is neither OSGI config nor “real”content, but something in between. Something a super-author is supposed to work with, which can be changed on runtime without the need to have admin privileges. And it solves these requirements quite well.

The design is elegant and extensible, the API easy to use (of course the documentation could be improved, but that’s probably something which every opensource project could benefit from. And the fact, that there is the wcm.io toolbox which adds even more functionality is absolutely great; thanks Stefan and team!

The only problem I see is AEM itself. The fact that you need to add a thirdparty software to write the configuration in a way that it can be integrated into the workflow is something one can question. I would expect that AEM offers this capability out of the box. And when we are it, I would hope that the use of CAC within the product itself will increase over time.

But besides that, it’s an overall cool thing. Finally a good way to get rid of these homegrown administration frameworks. Enjoy Sling Context Aware Configuration!

Sling Context-Aware configuration (part 6): replicate CA-Config

In the last blog posts we solely focused on the authoring case, where an author edits content and config. But I haven’t discussed yet how this CA-Config gets to the publish. As such config can be changed on the fly by authors, we cannot rely on deployment processes, but it has to fit into regular processes triggered and monitored by regular users.

And to make this happen, we will use an out-of-the-box functionality of AEM, replication. Replication is limited from an UI perspective to assets and pages, and we don’t want to change that for config.
That means, that we have to wrap all config into a cq:Page to make that work. Unfortunately Sling does not know anything about AEM pages and their structure (the page node and the jcr:content node), thus writing the configuration into an AEM page is not possible (since AEM 6.3 reading from pages is supported).

Fortunately the wcm.io toolbox can help us here as well. With the optional “PagePersistenceStrategy” ) you can read and write the configuration in the jcr:content node of the page. The code for is part of the “caconfig.extension” package and should be deployed into AEM, and then a OSGI configuration for io.wcm.caconfig.extensions.persistence.impl.PagePersistenceStrategy needs to be set (enabled=true).

(I udated my github project on ca-config and added all relevant pieces, you just need to enable the configuration for this service in the ui.apps module; I left it turned off so it does not break the functionality I outlined in earlier posts).

But how does this help us here? The problem are the references to the configuration; when you activate a page by default AEM wants to activate also all referenced pages and assets of the page which are not yet activated. But if a reference to a simple node is not considered and never listed.
But this does only work for directly referenced pages. If change a page, and its parent holds the sling:configRef node, the simple ReferenceSearch functionality of AEM does not help. But for this there is a custom ReferenceSearchProvider by wcm.io, which is turned on by default.

Now you have a chance to edit your configuration on author and then activate it afterwards. It’s still a bit clumsy, but at least mangeable as part of an authoring workflow.

And then I discovered a small feature, which has been recently added to the wcm.io Configuration Editor (version 1.3.0). The editor has now a “Publish this page” button, which claims to activate the configuration and all referenced CA Configurations. But unfortunately, it’s not working properly yet. But when it’s fixed, it can help a lot.