Disabling services and components in AEM

Sometimes you need to disable a service or a component; a simple example for this a servlet, which is used on authoring instance, but which must not be active on publish. There are several ways to achieve this. (In this blog post whenever I mention “service”, you can implicitly assume that it also works for SCR components; technically even “component” would be right wording, but in the AEM world “component” is heavily used word with a number of different meanings.)

A very simple and smart solution for your own codebase is the use of the SCR configuration policy; when this is used on a OSGI service the SCR runtime won’t start the service if no dedicated OSGI configuration exists (even the activate() method isn’t called). And because you can create OSGI configs based on run mode it’s the perfect way to enable or disable services.

Nice examples for this can be found in ACS AEM Commons:

This the the recommended way to write services; the decision to run or not run it is done on deployment/configuration time and not during development. And it’s the most complete way, because with proper configuration the service will never get active at all. The only (very) small drawback is that your project-managed OSIG configurations will increase.

A different way is to use a special property „enabled“, which is then checked in the service before doing something useful. But when you use the enabled-property, the service is properly started and registered to the OSGI runtime; thus it might get registered as servlet and into other service factories. You never know what is happening or what not, so you it’s always best to have the code ready.
This approach gives you also the choice on deployment time to enable or not to enable the service. But it has the drawback, that the service is active and code of it might run before checking the „enabled“ status. So from my understanding there is never really a usecase for this “enabled” property. An if it has a different function than turning the service on or off, it shouldn’t be named “enable”.

If you need to disable services, which are not under your control, and which neither offer a „enabled“ property nor the configurationPolicy approach, the only remaining choice is the ComponentDisabler of ACS AEM Commons. That’s basically a hack and should be your last resort, because it cannot prevent the startup of the service, but in fact shuts it down after the service has been started (and might have already been working). But if you can live with this constraints, it’s the way to go.

If you are a developer, I strongly recommend to learn and use the SCR ConfigPolicy setting!

Integrating AEM in a portal?

Content which is being produced and stored inside AEM is often widely used. Not only for direct publishing to the web, but nowadays also in Emails (using Adobe Campaign) or for the consumption of mobile apps. Also a very common case is the integration of AEM content in 3rd party systems, when content maintained in AEM is fetched by a portal. So the portal provides the transactional parts, but the content is fetched from AEM. This is the usecase I want to discuss in this post. When I write „portal“ it’s most often a J2EE portal, but there also other options. For this post the underlying technology stack doesn’t really matter.

In this case the portal is always the leading application, and AEM has just a supporting function (providing content). Depending on the exact case, you can embed AEM content as a portlet or use the REST approach to fetch AEM content. In this case AEM is mainly used for content authoring.

This approach has the unique benefit, that you can continue to use the existing portal solution and and provide your authors a easy-to-use solution for content authoring. The existing architecture is just extended by adding AEM.

But from my point of view this approach also has some severe problems:

  • In the editmode AEM can only display the elements, which are available to AEM. If the portal displays AEM content as part of the page next to a number of other elements, you don’t have these elements available on AEM. This limits the usability of the editmode or preview mode to display content in an end-user way.
  • A similar problem is, that with AEM pages you can normally edit most pieces of a page. In the case of the portal, you have to limit the possibilities of an author to a way provided by the portal. That means, if the portal does not allow you to change the page footer or add special HTML headers, you cannot change this from AEM (although it might be possible ootb). Within an AEM application I always recommend to allows authors to change every text and image of a page (including headers and footers), avoiding any hardcoded content.
  • When content is created in AEM, you need to develop templates and components for it. If you want to display this content in the portal, you need to build its equivalent there as well, but with a different technology. This is doubling the cost and you build dependencies in the cycles of the AEM development and the portal application development. You need to spend development on both sides to include new components or change the parts of a page, where content should be managed from authors.
  • Recent versions of AEM provide a good integration of the Adobe Marketing Cloud features, so authors can easily use them. When a portal is setup in a way, that it can fetch content from AEM, this integration normally needs additional effort and implementation work, which you don’t have when AEM is in the front.

A personal conclusion: I think, that using AEM just a content-authoring system is possible, but you ignore many of the features of it, which bring a lot of value. You increase costs by the need to develop new components and templates twice (for AEM and the portal) and decrease the time-to-market by synchronizing 2 development streams, which should basically be independent. And you cannot use many of the new Adobe Marketing Cloud integrations provided out-of-the-box.

So there are quite some arguments (especially for enduser-facing systems) not to use AEM just as a simple content-feed, but to establish AEM as frontend of your platform.

TarMK and SAN

Yesterday’s posting „TarMK and NAS“ got quite some attention, and today I got several times the question „And what about SAN?“. Well, here the answer to „Do you recommend TarMK on SAN?”

A little background first: A SAN (Storage area network) is a service, which offers block devices. It is part of today’s enterprise datacenter’s infrastructure, where attaching local disks to servers is not feasible and does not scale. If you are familiar with PCs, you can consider a block device like a partition on your hard disk. You cannot use a partition by itself, but you have to format it and put a filesystem on it.
That’s basically the same with a SAN: You get it as a raw device (called volume), you put a filesystem on it (for example ext4 when you use Linux) and then you can use it just like any other local drive. The only difference to a local drive is, that the connectivity is not provided by local SATA-port, but over a network (you’ll find the terms iSCSI or Fiber Channel, but that’s too much detail here).

And that’s the huge difference: With a SAN you get a block device, with a NAS you’ll get a shared filesystem.

So the basic principle is, that you can treat SAN like any local storage. And if you format it with a filesystem like ext4, btrfs or NTFS on Windows, you have a local filesystem. And like I said in yesterday’s post: When you have a local filesystem, where only a single system is controlling access to it, you can use mmap. And mmap is all we care about here!

My recommendations for TarMK are:

  • When you have the choice between SAN and NAS (sometimes you have): drop the NAS and go for SAN.
  • And when you have the choice between SAN and local drives, choose „SAN“ as well. Why? Because you never need to deal with the problems of „my hard drives are full and we don’t have any empty drive bays anymore on this server!“ anymore. Just allocate some more space to your SAN volume, resize the filesystem and that’s it. When you have mmap available for your TarMK, filesystem performance shouldn’t be something to worry about.

TarMK on NAS?

Today the question was raised, if TarMK running on NAS is a good idea. The short answer is: „No, it’s not a good idea“.

The long answer: The TarMK relies on the ability of the operating system to map files into memory (using the so-called memory-mapped technology, short: mmap; see this wikipedia page on it). Oak does this for the heavily used parts of the TarMK to increase performance, because then these parts don’t need to be read from filesystem again, but are rather always available in memory (which is by at least an order of magnitude faster). This works well with a local filesystem, where the operating system knows about every change happening on the filesystem, because it is the only one through which access to this filesystem happens, and it can make sure, that the content of the file on disk and in memory are in sync. I should also mention, that this memory isn’t part of the heap of the JVM, but rather the free RAM of the system is used for this purpose.

With a NAS the situation is different. A NAS is designed to be accessed by multiple systems in parallel without the need to synchronize between each other. The 2 most common filesystems for this are NFS and SMB/CIFS. On NFS one system can open a file and is not aware that a second system modifies in the same time. This is a design decision which prevents that a system can keep the content of a file on NFS and in-memory in sync. Thus mmap is not usable when you use a NAS to store your TarMK files.

And because mmap is not usable, you’ll get a huge performance impact compared to a local filesystem where mmap can be used. And then I haven’t even mentioned the limited bandwidth and higher latency of a remote storage compared to local storage.

If you migrate from CRX 2.x (till AEM 5.6.1) this problem was not as visible as it is now with Oak, because there was the BundleCache, which cached data already read from disk; this bundle cache is an in-memory, in-heap structure and you had to adjust the heap size for it. CRX 2.x did not use mmap.

But Oak does not have this in-memory cache any more, but relies on the mmap() feature of the operating system to keep the often-accessed parts of the filesystem (the TarMK) in memory. And that’s the reason why should leverage mmap as much as possible and therefor avoid a NAS for TarMK.

Resource path vs URL and rewriting links

Today I want to discuss some aspects of an AEM application, which is rarely considered during application development, but which normally gets very important right before a golive: the path element of a URL, and how it is constructed (either in full version or in a shortened one).

Newcomers to the AEM world sometimes ask how the public URLs are determined and maintained; from their experience with older or other CMS systems pages have an ID and this ID has to be mapped somehow to a URL.
Within AEM this situation is different, because the author creates a page directly in the tree structure of a site. And the name of the page can be directly mapped to a URL. So if an author creates a page /content/mysite/en/news/happy-new-year-2016, this page can be reached via https://HOST/content/mysite/en/news/happy-new-year-2016.html (in the simplest form).

From a technical point of view, the resource path is mapped to the path-element of a URL. In many cases this is a 1:1 mapping (that means, that the full resource path is taken as path of the URL). Often the „many“ means „in development environments“, because in production environments these kinds of URLs are long and contain redundant informations, which is something you should avoid. A URL also contains a domain part, and this domain part often carries information, so it isn’t needed in the path anymore.
So instead of „https://mysite.com/content/mysite/en/news.html“ we rather prefer „https://mysite.com/en/news.html“ and map only a subset of the resource path.

When mapping the resource path to a URL you must be careful, because the other way (the mapping of URL to resource path) has to work as well, and there must be exactly 1 mapping.

Such kind of mappings (I often call the mapping „resource path to URL path“ a forward mapping and the „URL path to resource path“ a reverse mapping) can be created using the /etc/map mechanisms . In a web application you need to use both mappings:

  1. when the request is received the URL path has to get mapped to a resource, so the sling resource processing can start.
  2. When the rendered page contains links to other pages, the resource path of these pages has to be provided as URL path.

(1) is done automatically by the sling if the correct ruleset is provided. (2) is much more problematic, because all references to resources provided by AEM have to be rewritten. All references? Generally spoken yes, I will discuss this later on.

This mapping can be done through the 2 API methods of the resource resolver:

You might wonder, why you never use these 2 methods in your own code,even if I wrote above, that all the links to other pages need to rewritten. Basically you don’t have to do this, because the HTML created by the rendering pipeline (including all filters) is streamed through the Sling Output Rewriting Pipeline. This chain contains a rewriter rule, which scans through all the HTML and tries to apply a forward mapping to all links.

But it does only run on HTML output, but there are other elements of a site, which contain references to content stored in AEM as well, for example Javascript or CSS files. References contained in these files are not rewritten, but delivered as they are stored in the repository. In many cases the setup is designed in a way, that a 1:1 mapping still works; but that’s not always possible (or wanted).

So please take this as an advice: Do not hardcode a path in CSS or Javascript files if there’s a chance that these paths need to be mapped.
Rewriting other formats than HTML is not part of AEM itself; of course you can extend the defaults and provide a rewriting capability for Javascript and CSS as well, but that’s not an easy task.)

The question is, if you really have to rewrite all resource paths at all. In many cases it is ok just to have the URLs of the HTML pages looking nice (because these are the only URLs which are displayed prominently) . But all the other resources (e.g assets, CSS and Javascript files) don’t need to get mapped at all, but there the default 1:1 mapping can be used. Then you’re fine, because you only have to do the mapping once in /etc/map and that’s it.

The Apache mod_rewrite modules also offers very flexible ways to do reverse mapping, but it lacks the a way to apply a forward mapping to the HTML pages (as the Sling Output Rewriter does). So mod_rewrite is a cool tool, but it is not sufficient to completely cover all aspects of resource mapping.

How can I avoid Oak write/merge conflicts?

Sandeep asked in a comment to the previous posting:

Even if your sessions are short, and you have made a call to session.refresh(true), it is possible that some one made a change before you did a session.save(), right? So, what is the best practice in dealing with such a scenario?
Keep refreshing (session.save(true)) in a loop, until your session.save() is successful or until you hit an assumed maximum number of attempts limit?
Or is there any other recommended best practice?

That’s a good question, but also a question with no satisfying answer. Whenever you want to modify nodes in the repository, there’s a change that the same nodes are changed in parallel, even you change only a single node or property. In reality, this rarely happens. Most features in AEM are built in way, that each step (or each workflow instance, each Sling job, each replication event, etc) has its own nodes to operate upon. So concurrency must not provoke such a situation, that multiple concurrent operations compete for writes on a single node.
So from a coding perspective it should possible to avoid such situations. Not only because of this kind of problems, but also because of performance and debugging.

Something you cannot deal with in this way are author changes. If 2 authors decide to change a page at the same time, it’s likely that they screw up the page. You can hardly avoid that just using code. But if you cannot guaratnee from a work organization point of view, that no 2 persons work at the same page at the same time, teach your authors to use the „lock“ feature. I basically prevents other authors from making changes temporarily. But according to the Oak documentation it isn’t suited to be used as short-living locks (in a database sense), but rather longer-living locks (author locks a page to prevent other authors from editing it).

So, to come a conclusion to Sandeeps question: It depends. If you designed your application carefully, you should rarely come into such situations, that you compete with multiple threads for a node. But whenever it occurs it should be considered as a bug, analyzed and then get fixed.
But there can be other cases, where this approach could make sense. In any case I would retry a few times (e.g. 10) and then break the operation with a meaningful log message. But I don’t think that it’s good to retry indefinitely.

AEM anti pattern: Long running sessions

AEM 6.x comes with Apache Oak, which features the use of the MVCC principle. MVCC (multi version concurrency control) is a principle, which gives you an view on a certain state within the repository. This state does not change, but can be considered as immutable. If you want to perform a change on the repository, the change is performed against this state and the applied (merged) to the HEAD state (which is the most current state within the repository). This merge is normally not a problem, if the state of the session doesn’t differ too much from the HEAD state; in case the merge fails, you get an OakMerge exception.

Note, that this is change compared to Jackrabbit 2.x and CRX 2.x, where the state of a session was always update, and where these merge exception never happened. This also means, that you might need to change your code to make it work well with Oak!

If you have long-running sessions, the probability of such an OakMerge exceptions is getting higher and higher. This is due to other changes happening in the repository, which could affect also the areads where your session wants to perform its changes. This is a problem especially in cases, where you run a service, which opens a session in the activate() method and closes it in deactivate() and uses it to save data to the repository as well. These are rare cases (because they are discouraged since years), but they still exist.

The problem is, that if a save() operations fails due to such an OakMerge exception, the temporary space of that session is polluted. The temporary space of a session is heap memory, where all the changes are stored, which are about to get saved. A successfully session.save() cleans that space afterwards, but if an exception happens this space is not cleaned. And if a session.save() fails because of such OakMerge exceptions, any subsequent session will fail as well.

Such an exception could like this (relevant parts only):

Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /content/geometrixx/en/services/jcr:content
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:237)
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:212)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.newRepositoryException(SessionDelegate.java:672)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.save(SessionDelegate.java:539)
at org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.save(ItemDelegate.java:141)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(ItemImpl.java:262)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(ItemImpl.java:259)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:294)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:113)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.save(ItemImpl.java:259)
at org.apache.jackrabbit.oak.jcr.session.NodeImpl.save(NodeImpl.java:99)

There are 2 ways to mitigate this problem:

  • Avoid long running sessions and replace them by a number of short-living sessions. This is the way to go and in most cases the easiest solution to implement. This also avoids the problems coming with shared sessions.
  • Add code to call session.refresh(true) before you do your changes. This refreshes the session state to the HEAD state, exceptions are less likely then. If you run into a RepositoryException you should explicitly cleanup your transient space using session.refresh(false); then you’ll loose your changes, but the next session.save() will not fail for sure. This the path you should choose when you cannot create new sessions.