Tag Archives: architecture

Sling Context-Aware configuration

In the previous blog posting I outlined why there’s a need to store some configuration outside of OSGI and I named Sling Context-Aware Configuration (CA-Config) as a way to implement it.

CA-Config is a relative new feature of AEM (introduced with AEM 6.2) which stores the configuration inside the repository; but besides that it offers some features, which are handy in a lot of usecases:

  • You don’t deal with nodes and resources anymore, but only with POJOs. This kind of implementation completly hides where the configuration is stored, but the location of the config is stored as property on the node.
  • CA-Config implements an approach based on the repository hierarchy. If a configuration value is not found at the specified config location, the hierarchy is walked towards the root and further configurations are consulted, until a configuration is found or the root node is hit (in that case a default value for the configuration should be used).
  • the lookup process is very configurable, although the default one is sufficient in most cases.

A good example for this are content-related settings in a multi-tenant system. Imagine a multi-country and multi-language use-case (where for some countries multiple languages have to be supported); and in each country you should have different user groups, which act as approver in the the content activation process. In the past this was often done by storing it as part of the pages itself, but from a permission point of view it was possible for the authors to change it themselves (which typically only causes trouble). But with CA-Config this can be stored outside of the /content tree in /conf, but closely aligned to the content, and being able to match the weirdest requirements from a structural point of view; that means it is easily possible then have in one country 3 different approver groups for sections of the page, while in another just having 1 group covering all.

Another large usecase are sites, which are based on MSM livecopies. There is always content which is site-specific (e.g. contact adresses, names, logos, etc). And if this content is baked into a”regular” content page which is rolled out by MSM, the blueprint content will always overwrite the content which is specific to the livecopy. With CA-config it is easy possible to maintain these site-specific configurations outside of the regular content tree.

Before CA-Config all of this was implemented for very simple usecases using the InheritanceValueMap, which also implemented such a “walk-up-the-tree-if-the-property-is-not-found” algorithm. But it only looked up the direct parents and did not consider more structured property data as it is possible with CA-Config.

And there were a lot of custom implementations, which created “per site” configuration pages, typically next to the “homepage” page, and restricted access to it via ACLs or dispatcher/webserver/mapping rules. And of course these continue to work, but the access to these configuration data could be managed via the CA-Config as well (depending on the way how the data is stored either by just specifiying these as config path or by a custom lookup logic which can be configured for the CA-Config). And on top of that, these per-site configurations were per-site and not supporting the above mentioned case of having 3 different approver on different parts of the site.

So CA-Config is very flexible and efficient way to retrieve configuration data, which is stored in the repository. But it’s just for retrieving, there’s no prescribed way how the configuration data gets there.
While AEM comes with a configuration editor, it does not support writing the settings for CA-Config (at the moment it’s only designed for editable templates and workflows). WCM.io has a content package for download which allows you to edit these configurations in the authoring UI.

In the next post I will outline how easy it is to actually use CA-Config.

Content and configuration

When developing for AEM, you are probably aware of the OSGI configuration. Every service stores its configuration within OSGI, and it’s very easy to extract the configuration from there for consumption. Also there are a number of best practices around deploying and maintaining these configurations.
This is very handy way to store information, and many developers tend to store all kind of configurable data inside OSGI.

But should all configuration stored inside OSGI? Let me show you some downsides of this approach:

  • OSGI configuration can be only changed through the “admin” user (or a member of the administrator group) and through the OSGI webconsole. Typically these accounts are owned by IT, because any access to the OSGI webconsole grants you a lot of chances to mess up the system. A normal user should not have access to the OSGI webconsole, neither from a permission point of view nor from the networking point of view (lock down /system/console URLs! Limit access to it to an admin network!)
  • Modifying configurations directly in the OSGI webconsole is not best practice; it’s OK to test something in the staging environment that way, but it should never be used to make changes in production. OSGI configuration should always be deployed in a content package, and not be changed via webconsole.
  • In many cases changing the configuration of a service causes a restart of this service and all other services depending on this service. That means, that doing such a change in a running system can cause lots of errors while services are stopping and starting.

Based on that, it is clearly visible, that not every configuration piece should be stored inside OSGI configuration.

The question is then: What should be stored inside OSGI configuration?

  • Configuration which has a global impact. For example the search path of the SlingMainServlet has the impact to break the complete application.
  • Configuration which should be tested. Testing means not only validating that a single operation is working (like a backend connection test), but which rather requires the run of some test cases.

And what should not be stored as OSGI configuration?

  • Configuration which is supposed to change not in accordance with any release or deployment cycle. For example an access token for an external system, which needs to be updated every month.
  • Configuration which should be changed by regular authors, but should not carried out by the admin user. The admin user is often owned by IT, and IT is typically very happy if they are not bothered with unrelated tasks from content authors.
  • Configuration which pertains to content. For example configuration if a contact email should be displayed as part of the page footer. Although it might be a global decision (probably not!) it should not be configured via OSGI.
  • Configuration is tight to a specific environment. For example you need to have the hostname of your site, but on UAT this name is of course different from production.

Starting with AEM 6.2 Sling Context-Aware Configuration (short: CA-Config) can be used to store and retrieve such configurations in a very efficient and sling-ish manner. More details in the followup posting. So stayed tuned.

The problems of multi-tenancy: the development model

in large enterprises AEM project tends to attract many different interested parties, which all love to make use of the features of AEM. They want to get onboard the platform as fast as they can. And this can be a real problem when it comes to such a multi-tenancy AEM platform.

In the previous post I wrote about the governance problems with such projects and all the politics involved in it. These problems pursue also in the daily business of the development and operation of such platforms.

Many of these tenants already have their development partners and agencies, which they are used to work with. These partners have experience in that specific area and know the business. So it’s quite likely, that the tenants continue to work with their partners also in this specific project. And there the technical problems starts.

Because at that point, you’ll realize, that you have multiple teams, which rarely collaborate or in worst case not at all. Teams which might have different skill levels, operate in different development models and use a different tooling. And each one of these teams gets its own prioritization and has its own schedule, and in most cases the amount of communication between these teams is quite low.

So now the platform owner (or the development manager on behalf) needs to setup a development model, which allows these multiple teams to feed all their results into a single platform. A model which doesn’t slow down your development agility and does not negatively impact the platforms stability and performance. And this is quite hard.

A number of these challenges are (note: most of them are not specific to AEM at all!):

  • How can you ensure communication and collaboration between all development parties? That’s often a part, which is left out (or forgotten) during time and budget estimation, therefor the amount of time spent on it is reflecting this fact. But that’s the most important piece here.
  • On the other hand, how do you make sure, that overhead of communication and coordination is as low as possible? In most cases this means, that each party gets its own version control system, its own maven module and its own build jobs. This allows a better separation of concerns during development and build time , but just postpones the problem. Because …
  • How you avoid the case, that multiple parties use the same names, which have to be unique? For example the same path below /apps or the same client library name? It’s hard to detect this at development time, when you don’t have checks, which cover multiple repositories and maven modules.
  • Somehow related: How do you handle dependencies to the same library but with different versions? Although OSGI supports this also during runtime, AEM isn’t really prepared for such a situation, that you should have a library in both version 1 and version 2. So you need to centrally manage the list of 3rd libraries (including version numbers), which the teams can use.
  • A huge challenge is testing. When you managed to deploy all delivered artifacts to a single instance (and combining these artifacts into deployable content packages often imposes its own set of problems), how do you test and where do you report issues? How happens the  triaging process to assign the issues to the individual teams for fixing? This can cause very easily a culture of blaming and denying, which make the actual bug fixing part very hard.
  • The same with production problems. No tenant and therefor no development team wants to get blamed for bringing down the platform because of some issue, so each problem can get very political, and teams start to argument, why they are not responsible.
  • And many more…

These are real world problems, which hurt productivity.

My thoughts how you can overcome (at least) some of the problems:

  • The platform owner should communicate open to all tenants and involved development teams, and encourage them to adhere to a common development model.
  • The platform owner should provide clear rules how each team is supposed to work, how they create and share their artifacts, and also clear rules for coding and naming.
  • The platform owner should be in charge for a small team which is supporting all tenants and all development teams and helps to align requirements and the integration of the different codebases. This team is also responsible for all the 3rd party library management and should have write access to the code repositories of all development teams.
  • Build and deployment is centralized as well.
  • Issue triaging is a cross-team effort.

This is all possible in a setup, where the platform owner is not only a function, which is not only responsible to run the platform, but also allowed to exercise control over the deployment artifacts of the individual parties.

Some sidenote: There is an architectural style called „micro services“, which seems to get traction at the moment. It claims to address the „many teams working on a single platform“ problem as well. But the whole idea is based on the split of monolithic application into single self-contained services, which does not really apply to this multi-tenancy problem, where every tenant wants to customize some aspects of the common system for itself. If you apply this approach to this multi-tenancy problem here, you end up with a multi-platform architecture, where each tenant has its own version of the platform.

Connecting dispatchers and publishers

Today I want to cover a question which comes up every now and then (my gut feeling says this question appeared at least once every quarter for the last 5 years …):

How should I connect my dispatchers with the publishs? 1:1, 1:n or m:n?

To give you an impression how these scenarios could look like I graphed the 1:1 and the n:m scenario.

publish-dispatcher-connections-1-1-final

The 1:1 setup, where each dispatcher is connected to exactly 1 publish instance; for the invalidation every publish is also connected only with its assigned dispatcher instance.

publish-dispatcher-connections-N-M-final

The n:m setup, where n dispatcher connce to m publish instances (for illustration here with n=3 and m=3); each dispatch is connected via loadbalancer to each publish instance, but each publish instance needs to invalidate all dispatcher caches.

I want to give you my personal opinion and answer to it. You might get other answers, both from Adobe consultants and other specialists outside of Adobe. They all are valuable insights into the question, how it’s done best in your case. Because it’s your case which matters.
My general answer to this question is: Use a 1:1 connection for these reasons:
  • it’s easy to debug
  • it’s easy to monitor
  • does not require any additional hardware or configuration
From an high-availability point of view this approach seems to have a huge drawback: When either the dispatcher or the publish instance fails, the other part is not available as well.
Before we discuss this, let me state some facts, which I consider as basic and foundation to all my arguments here:
  • The dispatcher and the web server (I can only speak for Apache HTTPD and its derivates, sorry IIS!) are incredibly stable. In the last 9 years I’ve setup and operated a good number of web environments and I’ve never seen a crashing web server nor a crashing dispatcher module. As long as noone stops the process, this beast is handling requests.
  • A webserver (and the dispatcher) is capable to deliver thousands of requests per second, if these files originate from the local disks and just need to be delivered. That’s at least 10 times the number any publish can handle.
  • If you look for the bottleneck in handling HTTP requests in your AEM architecture it’s always the publish application layer. Which is exactly the reason why there is a caching layer (the dispatcher) in front of it.
  • My assumption is, that a web server on modern hardware (and operating systems) is able to deliver static files with a bandwidth of more than 500 mbit per second (at a mixed file scenario). So in most cases before you reach the limit of your web servers, you reach the limit of your internet connection. Please note, that this number is just a rough guess (depending on many other factors).
Based on these assumptions, let’s consider these scenarios in a 1:1 setup:
  • When the publish instance fails, the dispatcher instance isn’t fully operational anymore, as it does not reach its renderer instance anymore; so it’s best to take it out of the load balancing pool.
    So does this have any effect on the performance capabilities of your architecture? Of course it has, it reduces your ability to deliver static files from the dispatcher cache. Which we could avoid if we had the dispatcher connected to other publishs as well. But as stated above, the delivery performance of static files isn’t a bottle neck at all, so when we take out 1 web server you don’t see any effect.
  • A webserve/dispatcher fails, and the connected publish instance is not reachable anymore, effectively reducing the power your bottleneck even more.
    Admitted, that’s true; but as stated above, I’ve rarely seen a crashed web server; so this case is mostly true in case of hardware problems or massive misconfigurations.
So, your have an measurable impact only in case that a web server hardware went down, in all other cases it’s not a problem for the performance.
This is a small drawbacks, but from my point of view the other benefits stated above outweigh it by far.
This is my standard answer, when there’s no more specific information available. It’s a good rule of thumb. But if you have more specific requirement, it might have sense to change the 1:1 rule to a different one.
For example:
  • You plan to have 20 publish instances. Then it doesn’t make sense to have 20 webserver/dispatchers as well.
  • You want to serve a huge amount of static data (e.g. 100 TB of static assets), so your n copies of the same file get’s expensive in terms of disk space.
If you choose a different approach than the 1:1 scenario described in this blog post, please keep these factors in mind:
  • How do you plan to invalidate the dispatcher caches? Which publish instance will invalidate which dispatcher cache?
  • How do you plan to do maintenance of the publish instances?
  • What’s the effort to add or remove a new publish instance? What’s need to be changed?
Before you plan to spend a lot of time and effort into building a complex dispatcher scenario, please think if a CDN isn’t a more appropriate solution to your problem…