AEM coding best practice: Servlets

You might say, servlets are old technology. So old, that every Java web developer should know everything about them.

Yes, servlets exist since the 90’s of last century (to be exact: 1997), and the basics haven’t really changed. So what’s so special about servlets, that I decide to write a dedicated blog post on it and title it „AEM coding best practice“?

Well, there’s nothing special in terms of coding. All things which are recommended since 1997, can still be considered valid. But there’s some subtle difference between servlet development for AEM development and the development of servlets for other types of applications: AEM (or: Sling) is resource oriented.

This aspects makes it hard for developers, who normally bind servlets to hardcoded paths (either via annotations or via the web.xml bindings). Binding servlets to a path is still possible in Sling, but it is actually an anti-pattern. Because then this servlet is not bound to a real existing resource, and therefor a number of goodies of Sling are not applicable.

Instead I recommend you to bind the servlets to resource types. The first and probably most obvious reason is that you do not need to hardcode any path within your code (or config), but instead you can just move the resource type to the path where you like it to be, and then the servlet can called via this path. And the second benefit is, that you can apply access control on the JCR nodes backing the respective resource. If you don’t have read access on that resource, you can not call the servlet. Which is a great way to restrict access to certain functions to a number of users, without implementing access control in your own code! But just using the ootb features of the JCR repository.

So this „bind to a resource type“ should remind you pretty much to the way, how resources and their components are wired. A resource has the property „resource type“, which denotes the component use to render this resource. With a servlet you can specify the resource type, your servlet wants to handle. So it’s basically the same, and instead of JSPs or Sightly scripts you can also use servlets to implement components. You can also easily implement the handling of selectors or different extensions in servlets.

I do not recommend to drop JSPs and Sightly altogether and switch to Servlets unless your fronted developers speak Java fluently, now and for the next years. Sightly has been developed just for this specific purpose: Frontend stuff should be handled by fronted developers and must not require java development knowhow. Use Sightly whenever possible.

And finally a bookmark for everyone working with Sling: The Sling servlets and scripts documentation.

AEM scaling patterns: Avoid shared sessions

The biggest change in AEM 6.0 compared to its prior versions is the use of Apache Oak as repository implementation instead of Apache Jackrabbit version 2.x; although both implement the JCR 2.0 API (Oak not completely yet, but the „important“ parts are there), there a number of differences between them.

In the area of scalability the most notable change is the use of the MVCC (multi version concurrency control, and proven approach taken from the relational database world) in Oak. It decouples sessions from the global repository state and are the basis for the scalability of the repository. But it comes with the price, that sessions should be used only by a single thread. It is a only a „should“, because Oak detects any usage of multiple threads accessing a single session and then serializes the access to it.

(For the records: The same recommendation already applied to Apache Jackrabbit 2.x, but the impact was never that high, mostly because it wasn’t that scalable as Oak now is.)

This isn’t a real limitation, but it requires careful design of any application. In the context of AEM normally it isn’t a problem at all, because all incoming HTTP requests use a dedicated session on their own. While this is true for the request, there is often functionality, which doesn’t follow this pattern.

I put a common pattern for this development pattern to Github, including a recommended implementation and a discouraged implementation. The problem in the discouraged example lies in the fact, that the repository session (in the example hidden behind the resource resolver abstraction) is opened once at the startup of the service by the thread, which does the activation of all services. But then resources are handed out to every other thread requesting the getConfiguration() method. If every request is doing this call, they all get synchronized here, thus limiting the scalability.

In the recommended example this problem is mitigated in a way, that each call to getConfiguration() opens a new session, reads the required resource and then closes the session. Here the session and its data is hold completely inside a thread, and there’s no need for synchronization anymore.

That’s the theory part, but how can you detect easily if you have this problem as well? The easiest way is to set the logging for the class org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate to DEBUG. Every time Oak detects the problem, that a session is used by multiple threads, it prints a stack trace to the log. If this happens on write access, it uses the WARN level, in case of reads the DEBUG level.

23.02.2015 09:21:56.916 *WARN* [0:0:0:0:0:0:0:1 [1424679716845] GET /content/geometrixx/en/services.html HTTP/1.0] org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate Attempt to perform hasProperty while another thread is concurrently reading from session-494. Blocking until the other thread is finished using this session. Please review your code to avoid concurrent use of a session.
java.lang.Exception: Stack trace of concurrent access to session-494
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:276)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:113)
at org.apache.jackrabbit.oak.jcr.session.NodeImpl.hasProperty(NodeImpl.java:812)
at org.apache.sling.jcr.resource.JcrPropertyMap.read(JcrPropertyMap.java:350)
...

If you want to have a scalable AEM application, you should carefully watch out for these log messages and optimize the use of shared sessions.

Meta: I am on Summit 2015 in Salt Lake City

It’s always hard as a techie to get to a customer conference (especially to high-profile one); even as Adobe employee it is hard to justify when you go to the Adobe Summit. But with the support of some co-workers I made it. I am very proud to present on the Adobe Summit this year in Salt Lake City.

I will have a (hands-on) lab session called “Silver bullets for dobe Experience Manager success“. It will cover some aspects how you can use existing and well-known features of the AEM technology stack to get the most out of AEM. When you are an AEM expert I won’t tell you any news, but maybe give you some inspiration and ideas. But don’t be too late for registration, the 3 slots I have are filling quickly.

I am really looking forward to it and I hope to meet many of my readers there. Just drop me a note if you want to meet with me in person.

Thanks,
Jörg

AEM anti-pattern: The hardcoded content structure

One the first things I usually do when we start an AEM project is to get a clear vision of the content and its structure. We normally draw a number of graphs, discuss a number of use cases, and in the end we come up with a content structure, which satisfies the requirements. Then we implement this structure as a hierarchy of nodes and that’s it.

In many cases developers start to use this structure without too much thinking. They assume, that the node structure is always like this. They even start to hardcode paths and language names or mimic this structure. Sometimes that’s not a problem. But it is getting hard, when you are building a multi-language or multi-tenant site and you start simple with only 1 language and 1 tenant; then you might end up with these languages or tenants being hardcoded, as “there was no time to make it right”. Imagine when you start with the second language or the second site and someone hardcoded a language or a site name/path.

So, what can you do to avoid hardcoded paths? Some information is always stored at certain areas. For example you can store basic contact information on the root node, which you can reuse on the whole site. So how do you identify the correct root node if you have multiple sites? Or how do you identify the language of the site?

The easiest way is to mark these site root pages (I prefer pages here over nodes, as they can be created using the authoring UI and are much more easier authorable) with a certain property and value. The easiest way is then if you have a special template with its dedicated resource type. Then you can identify these root pages using 2 approaches:

  • When you need to find them all, use a JCR query and look for all pages with this specific resource type.
  • When you need to find the siteroot page for a given page (or resource), just iterate up the hierarchy until you find a page with this resource type.

This mechanism allows you to be very flexible in terms of the content hierarchy. You no longer depend on pages being on a certain level or having special names. It’s all dynamic and you don’t have any dependency on the content structure. This page doesn’t even have to be the root-page of the public facing site, but is just a configuration page used for administration and configuration purposes. The real root-page can be a child or grand-child of it. You have lot’s of choices then.

But wait, there is a single limitation: Every site must have a sitters page using this special template/resourcetype. But that isn’t a hard restriction, isn’t it?

And remember: Never do string operations on a content path to determine something, neither the language nor the site name. Never.

Connecting dispatchers and publishers

Today I want to cover a question which comes up every now and then (my gut feeling says this question appeared at least once every quarter for the last 5 years …):

How should I connect my dispatchers with the publishs? 1:1, 1:n or m:n?

To give you an impression how these scenarios could look like I graphed the 1:1 and the n:m scenario.

publish-dispatcher-connections-1-1-final

The 1:1 setup, where each dispatcher is connected to exactly 1 publish instance; for the invalidation every publish is also connected only with its assigned dispatcher instance.

publish-dispatcher-connections-N-M-final

The n:m setup, where n dispatcher connce to m publish instances (for illustration here with n=3 and m=3); each dispatch is connected via loadbalancer to each publish instance, but each publish instance needs to invalidate all dispatcher caches.

I want to give you my personal opinion and answer to it. You might get other answers, both from Adobe consultants and other specialists outside of Adobe. They all are valuable insights into the question, how it’s done best in your case. Because it’s your case which matters.
My general answer to this question is: Use a 1:1 connection for these reasons:
  • it’s easy to debug
  • it’s easy to monitor
  • does not require any additional hardware or configuration
From an high-availability point of view this approach seems to have a huge drawback: When either the dispatcher or the publish instance fails, the other part is not available as well.
Before we discuss this, let me state some facts, which I consider as basic and foundation to all my arguments here:
  • The dispatcher and the web server (I can only speak for Apache HTTPD and its derivates, sorry IIS!) are incredibly stable. In the last 9 years I’ve setup and operated a good number of web environments and I’ve never seen a crashing web server nor a crashing dispatcher module. As long as noone stops the process, this beast is handling requests.
  • A webserver (and the dispatcher) is capable to deliver thousands of requests per second, if these files originate from the local disks and just need to be delivered. That’s at least 10 times the number any publish can handle.
  • If you look for the bottleneck in handling HTTP requests in your AEM architecture it’s always the publish application layer. Which is exactly the reason why there is a caching layer (the dispatcher) in front of it.
  • My assumption is, that a web server on modern hardware (and operating systems) is able to deliver static files with a bandwidth of more than 500 mbit per second (at a mixed file scenario). So in most cases before you reach the limit of your web servers, you reach the limit of your internet connection. Please note, that this number is just a rough guess (depending on many other factors).
Based on these assumptions, let’s consider these scenarios in a 1:1 setup:
  • When the publish instance fails, the dispatcher instance isn’t fully operational anymore, as it does not reach its renderer instance anymore; so it’s best to take it out of the load balancing pool.
    So does this have any effect on the performance capabilities of your architecture? Of course it has, it reduces your ability to deliver static files from the dispatcher cache. Which we could avoid if we had the dispatcher connected to other publishs as well. But as stated above, the delivery performance of static files isn’t a bottle neck at all, so when we take out 1 web server you don’t see any effect.
  • A webserve/dispatcher fails, and the connected publish instance is not reachable anymore, effectively reducing the power your bottleneck even more.
    Admitted, that’s true; but as stated above, I’ve rarely seen a crashed web server; so this case is mostly true in case of hardware problems or massive misconfigurations.
So, your have an measurable impact only in case that a web server hardware went down, in all other cases it’s not a problem for the performance.
This is a small drawbacks, but from my point of view the other benefits stated above outweigh it by far.
This is my standard answer, when there’s no more specific information available. It’s a good rule of thumb. But if you have more specific requirement, it might have sense to change the 1:1 rule to a different one.
For example:
  • You plan to have 20 publish instances. Then it doesn’t make sense to have 20 webserver/dispatchers as well.
  • You want to serve a huge amount of static data (e.g. 100 TB of static assets), so your n copies of the same file get’s expensive in terms of disk space.
If you choose a different approach than the 1:1 scenario described in this blog post, please keep these factors in mind:
  • How do you plan to invalidate the dispatcher caches? Which publish instance will invalidate which dispatcher cache?
  • How do you plan to do maintenance of the publish instances?
  • What’s the effort to add or remove a new publish instance? What’s need to be changed?
Before you plan to spend a lot of time and effort into building a complex dispatcher scenario, please think if a CDN isn’t a more appropriate solution to your problem…

Writing health checks — the problem

I started my professional career in IT operation at a large automotive company, where I supported the company’s brand websites. There I learned the importance of a good monitoring system, which supports IT operations in detecting problems early and accurately. And I also learned that even enterprise IT monitoring systems are fed best with a dead-simple HTML page containing the string “OK”.  Or some other string, which then means “Oh, something’s wrong!”.

In this post I want to give you some impression about the problematics of application monitoring, especially with Sling health checks in mind. Because it isn’t as simple as it sounds in the first place, and you can do things wrong. But every application should posses the ability to deliver some information about its current status, as the cockpit in your car gives you information about the available gas (or electricity) in your system.

The problem of async error reporting

The health check is executed when the reports are requested, so you cannot just push your error information to the health check as you log them during the processing. Instead you have to write them to a queue (or any other data structure), where this information is stored, until it is consumed.
The situation is different for periodical daily jobs, where only 1 result is produced every day.

Consolidating information

When you have many individual data, but you need to build a single data point about for a certain timeframe (say 10 seconds), you need to come up with a strategy to consolidate them. A common approach is to collect all individual results (e.g. just single “ok” or “not ok” information) and adding them to a list. When the  health check status needs to be calculated, this list is iterated and the number of “OKs” and “not oks” is counted, the ratio is calculated and reported; and after that the list is cleaned, and the process starts again.

When you design such consolidation algorithms, you should always keep in mind how errors are reported. In the above mentioned case, 10 seconds full of errors would be reported only for a single reporting cycle as CRITICAL. The cycle before and after could be OK again. Or if you have larger cycles (e.g. 5 minutes for your Nagios) think how 10 seconds of errors are being reported, while in the remaining 4’50’’ you don’t have no problem at all. Should it reported with the same result as you have the same number of errors spread over this 5 minutes? How should this case be handled if you have decided to ignore an average rate of 2% failing transactions?

You see that you can you can spend a lot of thinking on it. But be assured: Do not try to be to sophisticated. Just take a simple approach and implement it. Then you’re better than 80% of all projects: Because you have actually reasoned about this problem and decided to write a health check!

About JCR queries

In the past days 2 interesting blog posts have been written about the use of JCR query, Dan Klco’s “9 JCR-SQL2 queries every AEM developer should know” and “CQ Queries demystified” by @ItGumby.

Well, when you already have read my older articles about JCR query (part 1 and part 2), you might get the impression that I am not a big fan of JCR queries. There might be situations where that’s totally true.

When you come from a SQL world, queries are the only way to retrieve data; therefor many developers tend to use query without ever thinking about the other way offered by JCR: the “walk the tree” approach.

@ItGumby gives 2 reasons, why one should use JCR query: efficiency and flexibility in structure. First, efficiency depends on many factors. In my second post I try to explain which kind of query are fast, and which ones aren’t that fast. Just because of the way the underlaying index (even with AEM 6.0 it’s in 99,9% still Lucene) is working. With the custom indexes in AEM6 we might have a game changer for this.
Regarding flexibility: Yes, that’s a good reason. But there are cases, where you have a specific structure, when you are looking for hits only in a small area of the tree. But if you need to search the complete tree, a query can be faster.

Dan gives a number of good examples for JCR queries. And I wholeheartedly admit, that the number of JCR SQL examples in the net is way too low. The JCR specification is quite readable for a large part, but I was never really good at implementing code when I only have the formal description of the syntax of the language. So a big applause to Dan!
But please allow me the recommendation to test every query first on production content (not necessarily on your production system!), just to find out the timing and the number of results. I already experienced cases, where an implementation was fast on development but painfully slow on production just because of this tiny aspect.