The problems of multi-tenancy: governance

A recurring topic in AEM projects is multi-tenancy. Wikipedia describes multitenancy as „[…] software architecture in which a single instance of a software […] serves multiple tenants“. In the AEM projects I’ve done I encountered this pattern most when a company wants to host several brands and/or subsidiaries as independent tenants within a single AEM platform (that means: connected authoring and publishing instances). In this blog post I only cover the aspect of multi tenancy in a single company. Hosting tenants for multiple independent companies is a different story and likely even more complex.

At first sight multi-tenancy seems to be only a technical problem (separation of content/templates/components, privileges, etc.), but from what I learned, there is a much bigger problem, which you should solve first. And that’s the aspect of organization and governance.

Multitenancy is hard when different tenants (being brand organizations or subsidiaries) need to integrate into the single platform. Each tenant has its own requirements (depending on its special needs), its own timelines, and its own budget. You have larger tenants and smaller tenants on your AEM platform. But this does not necessarily reflect the power of these tenants inside the company. It may even contradict, and a smaller or less powerful organization or brand has such demands, that it will be the largest tenant on your AEM platform.

That means, that there will be conflicts, when it comes to defining scope, timeline and budget. The tenant which contributes more budget wants to have more influence on these 3 aspects than another tenant, which spends a significant smaller amount. But the smaller tenant might have needs which can overrule this, for example a tradeshow where some new features on the brand pages are absolutely required, while the other tenant (yet more powerful within the organization) has requirements, which are important in a more distant future. How are these requirements prioritized?

These questions (and conflicts) are not new, they exist for decades, even not centuries. But they have a huge impact on the platform owner. The platform owner wants to satisfy the needs of all the tenants, but is often faced with contradicting requirements; while on the technical side these can be often (more or less) solved (just by throwing people and time onto the problem), there are still things which are in the first place organizational issues, and which can only be solved on a organizational or political level. Then you have topics like:

  • How can you coordinate different timelines of different tenants, so you can satisfy all their needs?
  • Tenants want to have their own development teams or agencies. How can they work together and feed their results into a single platform without breaking it? Who’s responsible when the platform broke down?
  • How do you do funding when tenants contribute development work to the platform and other benefits from this work as well? Invoicing the tenants which benefit from other tenant’s development work?
  • What’s the role of the platform owner? Does the platform its own budget or is it solely funded by the tenants? Is the platform owner able to reject feature requests from tenants and say “no”?
  • How should the platform owner react with contradicting requirements? Is splitting the single platform into multiple ones (with different codebase) something which is desirable?

There are a lot of questions like these, and they are very specific to the company and the platform. They can all be solved, but the company and the organization itself has to solve them, but not the platform development team(s). Because then the organization foo will go down even to the developers (and as we all know: this kind of human being doesn’t really like that :-))

My ideal multi-tenancy project looks like this: A strong platform owner with some budget on its own. The tenants have pretty much the same size, and they fund the platform for the largest part to the same amount each. A steering committee (with participants from all tenants) deciding on all the organizational topics, and the same on the technical level if required. Requirements are consolidated on a project level and then implemented by a team, which is reporting to the platform owner.

Yeah, I have to admit, I haven’t found that customer project yet :-) But in such a project you as a member of the development team don’t really feel anymore the multi-tenancy aspect on an organizational level, but you only have to deal with it only on a technical level. Which is very nice.

AEM Basics: Runmodes

Today I want to discuss a feature, which is very basic and widely used. I want to discuss “runmodes“. You might already encountered it when you deployed an authoring instance and a publishing instance. Basically both can be deployed from the very same installation package, but just because of a magic string at the right place during installation the behaviour changes dramatically, one instance becomes an authoring instance, the other instance becomes a publishing instance. It’s because of the runmode you configured.

You can think of runmodes as labels or roles you attach to instances, and “author” and “publish” are just special ones. On runtime you can check for these labels and react accordingly (the SlingSettingsService is your friend here). A more sophisticated usecase is the OSGI configuration. Based on the location they config is placed, this config might be active or not, depending on the runmodes (see the AEM docs on this topic).

But the runmodes are not limited only to “author” or “publish”, but you can attach as many runmodes to an instance as you like. For example you can create labels indicating the environments of development (for example “integration” or “preproduction”), and you can have special configuration for these environments.This makes it a lot easier, if you want your application to behave differently on these environments as on production.

The best of all: When you use runmodes to differentiate your environments from each other, you can easily have all configurations for all environments in a single content package, and deploy this package to all environments, no matter if it’s the production or integration environments. If the runmode don’t match, it is just not getting active.

 

AEM coding best practice: Servlets

You might say, servlets are old technology. So old, that every Java web developer should know everything about them.

Yes, servlets exist since the 90’s of last century (to be exact: 1997), and the basics haven’t really changed. So what’s so special about servlets, that I decide to write a dedicated blog post on it and title it „AEM coding best practice“?

Well, there’s nothing special in terms of coding. All things which are recommended since 1997, can still be considered valid. But there’s some subtle difference between servlet development for AEM development and the development of servlets for other types of applications: AEM (or: Sling) is resource oriented.

This aspects makes it hard for developers, who normally bind servlets to hardcoded paths (either via annotations or via the web.xml bindings). Binding servlets to a path is still possible in Sling, but it is actually an anti-pattern. Because then this servlet is not bound to a real existing resource, and therefor a number of goodies of Sling are not applicable.

Instead I recommend you to bind the servlets to resource types. The first and probably most obvious reason is that you do not need to hardcode any path within your code (or config), but instead you can just move the resource type to the path where you like it to be, and then the servlet can called via this path. And the second benefit is, that you can apply access control on the JCR nodes backing the respective resource. If you don’t have read access on that resource, you can not call the servlet. Which is a great way to restrict access to certain functions to a number of users, without implementing access control in your own code! But just using the ootb features of the JCR repository.

So this „bind to a resource type“ should remind you pretty much to the way, how resources and their components are wired. A resource has the property „resource type“, which denotes the component use to render this resource. With a servlet you can specify the resource type, your servlet wants to handle. So it’s basically the same, and instead of JSPs or Sightly scripts you can also use servlets to implement components. You can also easily implement the handling of selectors or different extensions in servlets.

I do not recommend to drop JSPs and Sightly altogether and switch to Servlets unless your fronted developers speak Java fluently, now and for the next years. Sightly has been developed just for this specific purpose: Frontend stuff should be handled by fronted developers and must not require java development knowhow. Use Sightly whenever possible.

And finally a bookmark for everyone working with Sling: The Sling servlets and scripts documentation.

AEM scaling patterns: Avoid shared sessions

The biggest change in AEM 6.0 compared to its prior versions is the use of Apache Oak as repository implementation instead of Apache Jackrabbit version 2.x; although both implement the JCR 2.0 API (Oak not completely yet, but the „important“ parts are there), there a number of differences between them.

In the area of scalability the most notable change is the use of the MVCC (multi version concurrency control, and proven approach taken from the relational database world) in Oak. It decouples sessions from the global repository state and are the basis for the scalability of the repository. But it comes with the price, that sessions should be used only by a single thread. It is a only a „should“, because Oak detects any usage of multiple threads accessing a single session and then serializes the access to it.

(For the records: The same recommendation already applied to Apache Jackrabbit 2.x, but the impact was never that high, mostly because it wasn’t that scalable as Oak now is.)

This isn’t a real limitation, but it requires careful design of any application. In the context of AEM normally it isn’t a problem at all, because all incoming HTTP requests use a dedicated session on their own. While this is true for the request, there is often functionality, which doesn’t follow this pattern.

I put a common pattern for this development pattern to Github, including a recommended implementation and a discouraged implementation. The problem in the discouraged example lies in the fact, that the repository session (in the example hidden behind the resource resolver abstraction) is opened once at the startup of the service by the thread, which does the activation of all services. But then resources are handed out to every other thread requesting the getConfiguration() method. If every request is doing this call, they all get synchronized here, thus limiting the scalability.

In the recommended example this problem is mitigated in a way, that each call to getConfiguration() opens a new session, reads the required resource and then closes the session. Here the session and its data is hold completely inside a thread, and there’s no need for synchronization anymore.

That’s the theory part, but how can you detect easily if you have this problem as well? The easiest way is to set the logging for the class org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate to DEBUG. Every time Oak detects the problem, that a session is used by multiple threads, it prints a stack trace to the log. If this happens on write access, it uses the WARN level, in case of reads the DEBUG level.

23.02.2015 09:21:56.916 *WARN* [0:0:0:0:0:0:0:1 [1424679716845] GET /content/geometrixx/en/services.html HTTP/1.0] org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate Attempt to perform hasProperty while another thread is concurrently reading from session-494. Blocking until the other thread is finished using this session. Please review your code to avoid concurrent use of a session.
java.lang.Exception: Stack trace of concurrent access to session-494
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:276)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:113)
at org.apache.jackrabbit.oak.jcr.session.NodeImpl.hasProperty(NodeImpl.java:812)
at org.apache.sling.jcr.resource.JcrPropertyMap.read(JcrPropertyMap.java:350)
...

If you want to have a scalable AEM application, you should carefully watch out for these log messages and optimize the use of shared sessions.

Meta: I am on Summit 2015 in Salt Lake City

It’s always hard as a techie to get to a customer conference (especially to high-profile one); even as Adobe employee it is hard to justify when you go to the Adobe Summit. But with the support of some co-workers I made it. I am very proud to present on the Adobe Summit this year in Salt Lake City.

I will have a (hands-on) lab session called “Silver bullets for dobe Experience Manager success“. It will cover some aspects how you can use existing and well-known features of the AEM technology stack to get the most out of AEM. When you are an AEM expert I won’t tell you any news, but maybe give you some inspiration and ideas. But don’t be too late for registration, the 3 slots I have are filling quickly.

I am really looking forward to it and I hope to meet many of my readers there. Just drop me a note if you want to meet with me in person.

Thanks,
Jörg

AEM anti-pattern: The hardcoded content structure

One the first things I usually do when we start an AEM project is to get a clear vision of the content and its structure. We normally draw a number of graphs, discuss a number of use cases, and in the end we come up with a content structure, which satisfies the requirements. Then we implement this structure as a hierarchy of nodes and that’s it.

In many cases developers start to use this structure without too much thinking. They assume, that the node structure is always like this. They even start to hardcode paths and language names or mimic this structure. Sometimes that’s not a problem. But it is getting hard, when you are building a multi-language or multi-tenant site and you start simple with only 1 language and 1 tenant; then you might end up with these languages or tenants being hardcoded, as “there was no time to make it right”. Imagine when you start with the second language or the second site and someone hardcoded a language or a site name/path.

So, what can you do to avoid hardcoded paths? Some information is always stored at certain areas. For example you can store basic contact information on the root node, which you can reuse on the whole site. So how do you identify the correct root node if you have multiple sites? Or how do you identify the language of the site?

The easiest way is to mark these site root pages (I prefer pages here over nodes, as they can be created using the authoring UI and are much more easier authorable) with a certain property and value. The easiest way is then if you have a special template with its dedicated resource type. Then you can identify these root pages using 2 approaches:

  • When you need to find them all, use a JCR query and look for all pages with this specific resource type.
  • When you need to find the siteroot page for a given page (or resource), just iterate up the hierarchy until you find a page with this resource type.

This mechanism allows you to be very flexible in terms of the content hierarchy. You no longer depend on pages being on a certain level or having special names. It’s all dynamic and you don’t have any dependency on the content structure. This page doesn’t even have to be the root-page of the public facing site, but is just a configuration page used for administration and configuration purposes. The real root-page can be a child or grand-child of it. You have lot’s of choices then.

But wait, there is a single limitation: Every site must have a sitters page using this special template/resourcetype. But that isn’t a hard restriction, isn’t it?

And remember: Never do string operations on a content path to determine something, neither the language nor the site name. Never.

Connecting dispatchers and publishers

Today I want to cover a question which comes up every now and then (my gut feeling says this question appeared at least once every quarter for the last 5 years …):

How should I connect my dispatchers with the publishs? 1:1, 1:n or m:n?

To give you an impression how these scenarios could look like I graphed the 1:1 and the n:m scenario.

publish-dispatcher-connections-1-1-final

The 1:1 setup, where each dispatcher is connected to exactly 1 publish instance; for the invalidation every publish is also connected only with its assigned dispatcher instance.

publish-dispatcher-connections-N-M-final

The n:m setup, where n dispatcher connce to m publish instances (for illustration here with n=3 and m=3); each dispatch is connected via loadbalancer to each publish instance, but each publish instance needs to invalidate all dispatcher caches.

I want to give you my personal opinion and answer to it. You might get other answers, both from Adobe consultants and other specialists outside of Adobe. They all are valuable insights into the question, how it’s done best in your case. Because it’s your case which matters.
My general answer to this question is: Use a 1:1 connection for these reasons:
  • it’s easy to debug
  • it’s easy to monitor
  • does not require any additional hardware or configuration
From an high-availability point of view this approach seems to have a huge drawback: When either the dispatcher or the publish instance fails, the other part is not available as well.
Before we discuss this, let me state some facts, which I consider as basic and foundation to all my arguments here:
  • The dispatcher and the web server (I can only speak for Apache HTTPD and its derivates, sorry IIS!) are incredibly stable. In the last 9 years I’ve setup and operated a good number of web environments and I’ve never seen a crashing web server nor a crashing dispatcher module. As long as noone stops the process, this beast is handling requests.
  • A webserver (and the dispatcher) is capable to deliver thousands of requests per second, if these files originate from the local disks and just need to be delivered. That’s at least 10 times the number any publish can handle.
  • If you look for the bottleneck in handling HTTP requests in your AEM architecture it’s always the publish application layer. Which is exactly the reason why there is a caching layer (the dispatcher) in front of it.
  • My assumption is, that a web server on modern hardware (and operating systems) is able to deliver static files with a bandwidth of more than 500 mbit per second (at a mixed file scenario). So in most cases before you reach the limit of your web servers, you reach the limit of your internet connection. Please note, that this number is just a rough guess (depending on many other factors).
Based on these assumptions, let’s consider these scenarios in a 1:1 setup:
  • When the publish instance fails, the dispatcher instance isn’t fully operational anymore, as it does not reach its renderer instance anymore; so it’s best to take it out of the load balancing pool.
    So does this have any effect on the performance capabilities of your architecture? Of course it has, it reduces your ability to deliver static files from the dispatcher cache. Which we could avoid if we had the dispatcher connected to other publishs as well. But as stated above, the delivery performance of static files isn’t a bottle neck at all, so when we take out 1 web server you don’t see any effect.
  • A webserve/dispatcher fails, and the connected publish instance is not reachable anymore, effectively reducing the power your bottleneck even more.
    Admitted, that’s true; but as stated above, I’ve rarely seen a crashed web server; so this case is mostly true in case of hardware problems or massive misconfigurations.
So, your have an measurable impact only in case that a web server hardware went down, in all other cases it’s not a problem for the performance.
This is a small drawbacks, but from my point of view the other benefits stated above outweigh it by far.
This is my standard answer, when there’s no more specific information available. It’s a good rule of thumb. But if you have more specific requirement, it might have sense to change the 1:1 rule to a different one.
For example:
  • You plan to have 20 publish instances. Then it doesn’t make sense to have 20 webserver/dispatchers as well.
  • You want to serve a huge amount of static data (e.g. 100 TB of static assets), so your n copies of the same file get’s expensive in terms of disk space.
If you choose a different approach than the 1:1 scenario described in this blog post, please keep these factors in mind:
  • How do you plan to invalidate the dispatcher caches? Which publish instance will invalidate which dispatcher cache?
  • How do you plan to do maintenance of the publish instances?
  • What’s the effort to add or remove a new publish instance? What’s need to be changed?
Before you plan to spend a lot of time and effort into building a complex dispatcher scenario, please think if a CDN isn’t a more appropriate solution to your problem…