AEM anti pattern: Long running sessions

AEM 6.x comes with Apache Oak, which features the use of the MVCC principle. MVCC (multi version concurrency control) is a principle, which gives you an view on a certain state within the repository. This state does not change, but can be considered as immutable. If you want to perform a change on the repository, the change is performed against this state and the applied (merged) to the HEAD state (which is the most current state within the repository). This merge is normally not a problem, if the state of the session doesn’t differ too much from the HEAD state; in case the merge fails, you get an OakMerge exception.

Note, that this is change compared to Jackrabbit 2.x and CRX 2.x, where the state of a session was always update, and where these merge exception never happened. This also means, that you might need to change your code to make it work well with Oak!

If you have long-running sessions, the probability of such an OakMerge exceptions is getting higher and higher. This is due to other changes happening in the repository, which could affect also the areads where your session wants to perform its changes. This is a problem especially in cases, where you run a service, which opens a session in the activate() method and closes it in deactivate() and uses it to save data to the repository as well. These are rare cases (because they are discouraged since years), but they still exist.

The problem is, that if a save() operations fails due to such an OakMerge exception, the temporary space of that session is polluted. The temporary space of a session is heap memory, where all the changes are stored, which are about to get saved. A successfully cleans that space afterwards, but if an exception happens this space is not cleaned. And if a fails because of such OakMerge exceptions, any subsequent session will fail as well.

Such an exception could like this (relevant parts only):

Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /content/geometrixx/en/services/jcr:content
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.newRepositoryException(
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(

There are 2 ways to mitigate this problem:

  • Avoid long running sessions and replace them by a number of short-living sessions. This is the way to go and in most cases the easiest solution to implement. This also avoids the problems coming with shared sessions.
  • Add code to call session.refresh(true) before you do your changes. This refreshes the session state to the HEAD state, exceptions are less likely then. If you run into a RepositoryException you should explicitly cleanup your transient space using session.refresh(false); then you’ll loose your changes, but the next will not fail for sure. This the path you should choose when you cannot create new sessions.

Changed Sling bundles in AEM 6.0 Servicepack 3

Servicepack 3 for AEM 6.0 is now available (releasenotes).

Here’s the complete list of sling bundles in stock AEM 6.0 and the various levels of servicepacks. Bundles which are not available with a specific version are listed as “-“, version numbers marked in red appeared first in this servicepack. Where possible I added the links to the changes in servicepack 3.

The most notable change for SP3 from a Sling perspective is the switch to the fsclassloader (see 6D’s blogpost for it) for all scripting languages. So the compiled JSPs do not longer reside inside the repository (/var/classes), but now are placed in the filesystem.

Symbolic name aeM 6.0 aem 6.0 SP1 aem 6.0 SP2 aem 6.0 SP3 2.1.0 2.1.0 2.1.0 2.1.0 2.7.0 2.7.0 2.8.0 2.8.0 0.9.0.R988585 0.9.0.R988585 0.9.0.R988585 0.9.0.R988585 1.1.7.R1584705 1.1.7.R1584705 1.1.7.R1584705 1.1.7.R1584705 0.0.1.R1582230 0.0.1.R1582230 0.0.1.R1582230 0.0.1.R1681728 2.2.0 2.2.0 2.2.0 2.2.0 1.3.2 1.3.2 1.3.2 1.3.2 2.1.0 2.2.0 2.2.0 2.2.0 1.0.2 1.0.0 1.0.0 1.0.0 1.0.0 2.0.6 2.0.6 2.0.6 2.0.6 4.0.0 4.0.0 4.0.0 4.0.0 1.0.2 1.0.2 1.0.2 1.0.2 2.1.4 2.1.4 2.1.4 2.1.4 2.2.0 2.2.0 2.2.0 2.2.0 2.4.2 2.4.2 2.4.2 2.4.8 (changelog) 3.2.0 3.2.0 3.2.0 3.2.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.8 1.0.8 1.0.8 1.1.6 (changelog) 1.0.0 1.0.0 1.0.0 1.0.0 2.3.3.R1588174 2.3.3.R1588174 2.3.10 2.3.10 3.3.10 3.3.10 3.5.0 3.7.4 (changelog) 1.0.2 0.2.2 0.2.2 0.2.2 0.2.2 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 2.2.8 2.2.8 2.2.8 2.2.8 1.0.0 1.0.0 1.0.0 1.0.0 3.5.0 3.5.0 3.5.4 3.6.4 (changelog) 1.0.12 1.0.12 1.0.12 1.1.2 (changelog) 1.0.2 1.0.2 1.0.4 1.1.0 (changelog) 3.1.6 3.1.6 3.1.8 3.1.8 0.1.0 0.1.0 0.1.0 0.1.0 2.2.0 2.2.0 2.2.0 2.2.0 2.2.2 2.2.2 2.2.2 2.2.2 3.2.0 3.2.0 3.2.0 3.2.0.B001-EMPTY 2.1.0 2.1.0 2.1.0 2.1.0 2.1.6 2.1.6 2.1.6 2.1.6 1.2.0 1.2.0 1.2.0 1.2.0 2.0.0 2.0.0 2.0.0 2.0.0 1.0.0 1.0.0 1.0.0 1.0.0 2.3.7.R1591843 2.3.7.R1591843 2.3.8 2.4.4.B001 (changelog) 0.0.1.R1562502 0.0.1.R1562502 0.0.1.R1562502 0.0.1.R1562502 2.2.2 2.2.2 2.2.2 2.2.2 1.0.2 1.0.2 1.0.2 1.0.2 1.2.0 1.2.0 1.2.0 1.2.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.2 1.0.4 1.0.4 1.0.4 1.0.2 1.0.2 1.0.2 1.0.2 0.0.1.R1579485 0.0.1.R1579485 0.0.1.R1579485 0.0.1.R1579485 1.0.0 1.0.0 1.0.0 1.0.0 1.1.2 1.1.2 1.1.2 1.1.2 1.1.0 1.1.1.R1618115 1.1.6 1.1.14.B008 (changelog) 1.0.4 1.0.4 1.0.4 1.0.4 2.1.6 2.1.6 2.1.6 2.1.6 2.0.26 2.0.26 2.0.26 2.0.26 2.0.6 2.0.11.R1607999 2.0.12 2.0.12 2.0.13.R1566989 2.0.14 2.0.14 2.0.14 2.0.28 2.1.4 2.1.4 2.1.6 (changelog) 2.2.0 2.2.0 2.2.0 2.2.0 2.0.6 2.0.6 2.0.6 2.0.6 1.0.6 1.0.6 1.0.6 1.0.10 (changelog) 1.0.0 1.0.0 1.0.0 1.0.4 (changelog) 1.0.0.Revision1200172 1.0.0.Revision1200172 1.0.0.Revision1200172 1.0.0.Revision1200172 2.1.8 2.1.8 2.1.8 2.1.8 2.3.4 2.3.5.R1592719 2.3.5.R1592719 2.3.5.R1592719-B004 2.3.2 2.3.2 2.3.6 2.3.6 1.3.0 1.3.0 1.3.0 1.3.0 0.0.1.Rev1526908 0.0.1.Rev1526908 0.0.1.Rev1526908 0.0.1.Rev1526908 0.0.1.Rev1387008 0.0.1.Rev1387008 0.0.1.Rev1387008 0.0.1.Rev1387008 1.0.0 1.0.0 1.0.0 1.0.0

The problems of multi-tenancy: tenant separation and „friendly tenants”

In the last articles (1,2) I covered some aspects of multi-tenancy, which are very likely to occur in AEM projects (but not restricted to such projects). I stressed that there a lot of aspects which have the potential to cause trouble on a non-technical level. But you cannot draw a clear line between the business/political aspects and the technical aspects, because they often tend to fuel each other. Implementing multi-tenancy is political decision which implies design, implementation and operational decisions, which are not for free; which in turn then heat up any business discussion about the costs of the platform. And then the call goes back to the architect not to implement the full stack, but only a reduced one, which can cause trouble again on business side … there are a lot of these stories, and it only proves, that you can hardly do a decision in one domain without impacting the other.

But let’s focus now on the technical level and how it is influenced by multi-tenancy. In any multi-tenancy system the full and clean separation of the tenants is the ultimate goal. That means: no shared resources beyone the ones which are supposed to be shared intentionally. At least the usage of the shared resources must be restricted in a way, that one tenant cannot negatively influence the other tenants; or that the influence of any single tenant on the others is marginal and always managable. On the other hand it should be cost-effective, that means, that a multi-tenancy system for N clients must be cheaper than N non-multi-tenant systems (a single system for each tenant).

(If you reach this point it might make sense to evaluate if the additional cost of making a system capable to operate multiple tenants outweighs the cost and complexity of managing more systems. If that’s the case, stop here and replicate create a single-tenant application and deploy it to multiple systems.)

The simplest approach to multi-tenancy is to host all tenants (or as much as possible) on a single system. As all these tenants now live within the boundaries of a single instance (a single JVM, a single hardware/virtual machine) they share all the hardware resources (CPU, memory, I/O), but also the software resources (threads, queues, caches, „the application“). This sharing means formost, that the maximum performance of each tenant is limited under the assumption, that other tenants need resources at the same time too.
This scenario (let’s call it „friendly tenants“) is often encountered in enterprises, where multiple brands, divisions or coutries are hosted on a single platform. But it has some implications:

  1. All tenants share the same application.
  2. Downtime for platform upgrades/maintenance/bugfixes affects all tenants.
  3. Platform failures affects all tenants.

These limitations can be quite heavy. While the limitations 2 and 3 are accepted in most cases (given that the platform is stable and performant otherwise), the limitation of the development scope is often considered as problem. Because it enforces, that all changes a tenants demands go into the platform; thus all requirements of all tenants are prioritized from a platform perspective („which features bring the most benefit for all tenants?“), so the priorities of a single tenant don’t have that much weight.
Of course you can allow custom development for individual tenants (maybe even by multiple development parties), but then the application must be designed and implemented carefully to avoid „friendly fire“ (changes to a tenant affects other tenants as well).

This „friendly tenant“ scenario is likely to have the lowest costs, as the usage of resources is low compared to the number of tenants and the individual requirements of tenants are often considered lower priority compared to the requirements shared by a set of tenants. With AEM you can implement such a scenario quite well using ACLs. The MSM gives you a good tool when the tenants also share content.

Dispatcher and shared content

In the September session of the Ask thec expert series (passcode: “Dispatch”) I talked about problems arising out the requirement to deal with multiple sites and each site having it’s own domain, and that a sling mapping is used to map the long repository paths to shorter URLs (like mapping /content/geometrix/en/services.html to I already tried to deal with this question in the Q&A part of the session, but I will write it here in more depth.

In the session on AEM dispatcher setups there was a question how to deal with shared content. If you do a straight-forward configuration of the dispatcher and map a shared content path (being it assets or pages) into the site structure of a site, the content is cached at this location in the dispatcher cache, but the invalidation happens only once at the „original“ path. So the content within the mapped paths in the site structure is not invalidated at all.
This is a problem, but you can see this problem from more than one angle.

The first question is, if you really need to share this content at all. I am not a SEO expert, but from what I heard, having duplicate content on multiple domains gives you a negative score on your page rank. Also from my point of view at some point the necessity rises to customize this shared content per tenant, which leads often to copy a shared page into the site and customize it there, essentially not using the shared content anymore. If there’s the risk of having this problem, you should think of using the MSM to avoid this „copy-and-adapt“ workflow and make it manageable. In that case you have true local copies and you don’t need to map the pages into the site content structure, avoiding the caching and invalidation problem completely.

The second question is, if it makes sense to offload all this shared content into a dedicated „shared ocontent“ domain, which is used by all sites; in that case the need to duplicate is avoided as well.

These are 2 suggestions to avoid some of the problems of the „shared content“ approach. If you cannot use them, you have to go the way of duplicate content at dispatcher level, with all the implications it has, mainly:

  • potential SEO problems because of duplicate content
  • increased disk consumption on dispatcher level

To deal with the problem of duplicate content and invalidation you have to go the way to create a custom invalidation logic, which is aware of your special setup and which does the invalidation accordingly. See the documentation on the dispatcher regarding this topic.

1000 nodes per folder and Oak orderable nodes

Every now and then there’s the question, how many child nodes are supported in JCR. While the technical correct answer is „there is no limit“, in practice there are some limitations.

In CRX 2.x the nodes are always ordered. In CRX 2.x even unordered nodes are treated as if they are ordered, which made the difference nearly to non-existent. [Thanks Justin for making this clear!] This means, that the order needs to be maintained on all operations, including add and remove of sibling nodes. The more child nodes a node has, the more time it takes to maintain this list.

So, what’s about this „1000 child nodes“ limit? First of all, this number is arbitrary :-) But when you use CRXDE Lite, it’s getting really slow to browse a node with lots of child nodes, mostly because of the time it takes the Javascript to render it. But of course also the performance of add and remove operations degrade linearly. Also you don’t have hardly cases where you would have more than 1000 child nodes.

But for the aspect of reading nodes there is no impact on performance. So it is not a problem to have 6000 nodes in /libs/wcm/core/i18n/en, because you only read the nodes, but you don’t change them.

But nevertheless this „limit“ can be cumbersome, especially if you don’t need to the feature of ordered child nodes. Also the fact that there is this limit means, that adding you have the impact (at a a lower level) also already with less nodes.

With Apache Oak this has changed. With Oak nodes are not ordered unless its parent has node type which supports ordering.

To illiustrate the difference between sling:folder and sling:orderedFolder; i did a small test. I wrote a small benchmark to create 5000 nodes, then add more nodes, do random reads and delete them afterwards. For every operation a single node is created or deleted followed by a save(). (Sourcecode)

Operation sling:Folder sling:OrderedFolder
Create 5000 nodes 6124 ms 17129 ms
Random read 500 nodes 2 ms 9 ms
Add 500 nodes 112 ms 564 ms

This small benchmark (executed on 2014 Macbook pro with SSD, AEM 6.0, TarMK, Oak 1.0.0) shows:

  • Adding lots of child nodes to a node is much faster when you using a non-ordering nodetype
  • Also random read is faster, obviously Oak can use more efficient data structures than a list, if it doesn’t need to maintain the ordering.

The factor of 3-4 is obviously quite significant. Of course the benefit is smaller if you have less child nodes.

The problems of multi-tenancy: the development model

in large enterprises AEM project tends to attract many different interested parties, which all love to make use of the features of AEM. They want to get onboard the platform as fast as they can. And this can be a real problem when it comes to such a multi-tenancy AEM platform.

In the previous post I wrote about the governance problems with such projects and all the politics involved in it. These problems pursue also in the daily business of the development and operation of such platforms.

Many of these tenants already have their development partners and agencies, which they are used to work with. These partners have experience in that specific area and know the business. So it’s quite likely, that the tenants continue to work with their partners also in this specific project. And there the technical problems starts.

Because at that point, you’ll realize, that you have multiple teams, which rarely collaborate or in worst case not at all. Teams which might have different skill levels, operate in different development models and use a different tooling. And each one of these teams gets its own prioritization and has its own schedule, and in most cases the amount of communication between these teams is quite low.

So now the platform owner (or the development manager on behalf) needs to setup a development model, which allows these multiple teams to feed all their results into a single platform. A model which doesn’t slow down your development agility and does not negatively impact the platforms stability and performance. And this is quite hard.

A number of these challenges are (note: most of them are not specific to AEM at all!):

  • How can you ensure communication and collaboration between all development parties? That’s often a part, which is left out (or forgotten) during time and budget estimation, therefor the amount of time spent on it is reflecting this fact. But that’s the most important piece here.
  • On the other hand, how do you make sure, that overhead of communication and coordination is as low as possible? In most cases this means, that each party gets its own version control system, its own maven module and its own build jobs. This allows a better separation of concerns during development and build time , but just postpones the problem. Because …
  • How you avoid the case, that multiple parties use the same names, which have to be unique? For example the same path below /apps or the same client library name? It’s hard to detect this at development time, when you don’t have checks, which cover multiple repositories and maven modules.
  • Somehow related: How do you handle dependencies to the same library but with different versions? Although OSGI supports this also during runtime, AEM isn’t really prepared for such a situation, that you should have a library in both version 1 and version 2. So you need to centrally manage the list of 3rd libraries (including version numbers), which the teams can use.
  • A huge challenge is testing. When you managed to deploy all delivered artifacts to a single instance (and combining these artifacts into deployable content packages often imposes its own set of problems), how do you test and where do you report issues? How happens the  triaging process to assign the issues to the individual teams for fixing? This can cause very easily a culture of blaming and denying, which make the actual bug fixing part very hard.
  • The same with production problems. No tenant and therefor no development team wants to get blamed for bringing down the platform because of some issue, so each problem can get very political, and teams start to argument, why they are not responsible.
  • And many more…

These are real world problems, which hurt productivity.

My thoughts how you can overcome (at least) some of the problems:

  • The platform owner should communicate open to all tenants and involved development teams, and encourage them to adhere to a common development model.
  • The platform owner should provide clear rules how each team is supposed to work, how they create and share their artifacts, and also clear rules for coding and naming.
  • The platform owner should be in charge for a small team which is supporting all tenants and all development teams and helps to align requirements and the integration of the different codebases. This team is also responsible for all the 3rd party library management and should have write access to the code repositories of all development teams.
  • Build and deployment is centralized as well.
  • Issue triaging is a cross-team effort.

This is all possible in a setup, where the platform owner is not only a function, which is not only responsible to run the platform, but also allowed to exercise control over the deployment artifacts of the individual parties.

Some sidenote: There is an architectural style called „micro services“, which seems to get traction at the moment. It claims to address the „many teams working on a single platform“ problem as well. But the whole idea is based on the split of monolithic application into single self-contained services, which does not really apply to this multi-tenancy problem, where every tenant wants to customize some aspects of the common system for itself. If you apply this approach to this multi-tenancy problem here, you end up with a multi-platform architecture, where each tenant has its own version of the platform.

What is new in Sling with AEM 6.1?

AEM 6.1 is out. Congratulations to my colleagues in the engineering departments for their hard work in the last year.

Every release of AEM 6.1 goes together with changes in Sling, mostly bugfixes and smaller enhancements. Normally these changes are not mentioned directly in the releasenotes of AEM, but in most cases you have to look them up on your own.

For the AEM 6.1 release I want to create a small series of blog posts, which point out the major changes in the packaged Sling bundles, only considering the changes available in 6.1 and not yet in 6.0 (not included hotfixes and featurepacks). I will try to cover some major changes and improvements you can use in your projects.

So let’s start with the complete list of Sling bundles (sorted alphabetically) and their versions in both 6.0 and 6.1; and for completeness I also added the versions of AEM 5.6.1. In case a bundle isn’t available in a specific version, I inserted a “-“.

Symbolic Name of the Bundle AEM 5.6.1 AEM 6.0 AEM 6.1 2.1.0 2.1.0 2.1.4 2.4.3.R1488084 2.7.0 2.9.0 0.9.0.R988585 0.9.0.R988585 0.9.0.R988585 1.1.2 1.1.7.R1584705 1.3.6 0.0.1.Rev1231138 0.0.1.R1582230 0.0.1.R1582230 2.1.2 2.2.0 2.2.0 1.3.0 1.3.2 1.3.2 2.1.0 2.1.0 2.2.0 1.0.0 1.0.0 1.0.0 1.0.0 2.0.6 2.0.6 2.0.10 3.0.0 4.0.0 4.0.2 1.0.2 1.0.2 1.0.4 2.1.4 2.1.4 2.1.8 2.2.0 2.2.0 2.2.2 2.3.4 2.4.2 2.4.6 3.1.0 3.2.0 3.2.0 1.0.0 0.1.0.R1484784 1.0.0 1.0.2 0.1.0.R1486590 1.0.8 1.1.0 0.1.0.R1484784 1.0.0 1.0.0 0.1.0 0.1.1.r1678168 2.2.8 2.3.3.R1588174 2.4.2 3.1.5.R1485539 3.3.10 3.5.5.R1667281 1.0.0 0.2.2 0.2.2 0.2.2 1.0.0 1.0.0 1.1.4 1.0.0 1.0.0 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.0.2 1.1.0 1.2.0 1.1.0 1.1.2 2.2.4 2.2.8 2.4.0 1.0.0 1.0.0 1.0.0 1.0.0 3.4.6 3.5.0 3.6.4 1.0.10 1.0.12 1.1.2 1.0.0 1.0.2 1.0.2 1.1.0 3.1.6 3.1.6 3.1.16 0.1.0 0.1.0 0.1.0 2.1.0 2.2.0 2.2.0 2.1.2 2.2.2 2.2.2 3.1.12 3.2.0 2.1.0 2.1.0 2.1.0 2.1.6 2.1.6 2.1.10 1.2.0 1.2.0 1.2.2 2.0.0 2.0.0 2.0.0 0.0.1.R1345943 1.0.0 1.0.2 2.2.9.R1483758 2.3.7.R1591843 2.5.0 0.0.1.R1562502 1.0.2 2.2.0 2.2.2 2.2.2 1.0.2 1.0.2 1.2.0 1.2.0 1.2.0 1.0.0 1.1.0 1.0.2 1.1.0 1.0.2 1.0.4 0.0.1.R1579485 1.0.0 0.0.1.R1479861 1.0.0 1.0.0 1.1.2 1.2.9.R1675563-B002 1.0.6 1.1.0 1.2.4 1.0.4 1.0.4 1.0.4 2.1.4 2.1.6 2.1.6 2.0.24 2.0.26 2.0.28 2.0.6 2.0.6 2.0.12 2.0.12 2.0.13.R1566989 2.0.16 2.0.28 2.0.28 2.1.6 2.1.8 2.2.0 2.2.4 2.0.6 2.0.6 2.0.6 1.0.2 1.0.4 1.0.4 1.0.6 1.0.10 1.0.0 1.2.0 1.0.0.Revision1200172 1.0.0.Revision1200172 1.0.0.Revision1200172 2.1.4 2.1.8 2.1.10 2.3.1.R1485589 2.3.4 2.3.6 2.2.4 2.3.2 2.3.6 1.2.2 1.3.0 1.3.6 0.0.1.Rev1387008 0.0.1.Rev1526908 0.0.1.Rev1526908 0.0.1.Rev1387008 0.0.1.Rev1387008 0.0.1.Rev1387008 1.0.0 1.0.0 1.0.2