4 problems in projects on performance tests

Last week I was in contact with a collague of mine who was interested in my experience on performance tests we do in AEM projects.  In the last 12 years I worked with a lot of customers and implemented smaller and larger sites, but all of them had at least one of the following problems.

 (1) Lack of time

In all project plans time is allocated for performance tests, and even resources are assigned to it. But due to many unexpected problems in the project there are delays, which are very hard to compensate when the golive or release date is set and already announced to or by senior management. You typically try to bring live the best you can, and steps as performance tests are typically reduced first in time. Sometimes you are able to do performance tests before the golive, and sometimes they are skipped and postponed after golive. But in case when timelines cannot be met the quality assurance and performance tests are typical candidates which are cut down first.

(2) Lack of KPIs

When you the chance to do performance tests, you need KPIs. Just testing random functions of your website is simply not sufficient if you don’t know if this function is used at all. You might test the functionality which the least important one and miss the important ones. If you don’t have KPIs you don’t know if your anticipated load is realistic or not. Are 500 concurrent users good enough or should it rather be 50’000? Or just 50?

(3) Lack of knowledge and tools

Conducting performance tests requires good tooling; starting from the right environment (hopefully comparable sizing to production, comparable code etc) to the right tool (no, you should not use curl or invent your own tool!) and an environment where you execute the load generators. Not to forget proper monitoring for the whole setup. You want to know if you provide enough CPU to your AEM instances, do you? So you should monitor it!.

I have seen projects, where all that was provided, even licenses for the tool Loadrunner (an enterprise grade tool to do performance tests), but in the end the project was not able to use it because noone knew how to define testcases and run them in Loadrunner. We had to fall back to other tooling or the project management dropped performance testing alltogether.

(4) Lack of feedback

You conducted performance tests, you defined KPIs and you were able to execute tests and get results out of them. You went live with it. Great!
But does the application behave as you predicted? Do you have the same good performance results in PROD as in your performance test environment? Having such feedback will help you to refine your performance test, challenging your KPIs and assumptions. Feedback helps you to improve the performance tests and get better confidence in the results.

Conclusion

If you haven’t encountered these issues in your project, you did a great job avoid them. Consider yourself as a performance test professional. Or part of a project addicted to good testing. Or you are so small that you were able to ignore performance tests at all. Or you just deploy small increments, that you can validate the performance of each increment in production and redeploy a fixed version if you get into a problem.

Have you experienced different issues? Please share them in the comments.

Per folder workflows

Customizing workflows is daily business in AEM development. When in many cases it’s getting tricky when the workflow needs to behave differently based on the path. On top of that some functionality in the authoring UI is hard-wired to certain workflows. For example “Activate” will always trigger the “request for activation workflow” (referenced by path “/etc/workflow/models/request_for_activation“), and to be honest, you should not overlay this functionality and this UI control!

For cases like this there is the “Workflow Delegation Step” of ACS AEM Commons. It adds the capability to execute a new workflow model on the current payload and terminate the current one. How can this help you?

For example if you want to use a workflow “4 Eyes approval” (/etc/workflow/models/4-eyes-approval) for /content/mysite, but just a simple activation (“/etc/workflow/models/dam-replication“) without any approval for DAM Assets. But in all cases you want to use the ootb functionality and behaviour behind the “Activate” button. So you start customizing the “Request for activation” workflow in this way.

  • Add the “Workflow delegation step” as a first step into the “Request for activation” workflow and configure it like this:
    workflowModelProperty=activationWorkflow
    terminateWorkflowOnDelegation=true
  • Add a property “activationWorkflow” (String) with the value “/etc/workflow/models/4-eyes-approval” to /content/mysite.
  • Add a property “activationWorkflow” (String) with the value “/etc/workflow/models/dam-replication” to /content/dam.

And that’s it. When you now start the “Request for activation” in the path /content/mysite/page it will execute the “4 Eye approval workflow” instead! To be exact: It will start the new workflow on the payload and then terminate the default “Request for activation” workflow.

If you understand this principle, you can come up with millions different usecases:

  • Running different “asset update workflows” on different folders
  • Every tenant come up with their own set of workflow models, and they could even choose on their own which one they would like to use (if you provide them proper permissions to configure the property and some nice UI).
  • (and much more)

And all works without customizing the workflow launchers; and you need to customize the workflow models only once (adding the new first step).

Additional notes:

  • It would be great to have this workflow step as first step in every workflow by default (that means as part of the product). If you want to customize the workflow you simply copy the default one, customize it and branch to it by default (that means by adding the properties to the /content node). If you don’t customize it, nothing changes. No need to overlay/adjust product settings! (Yes, I already requested it from the AEM product engineering.)
  • We are working on a patch, which allows to specifiy this setting in /conf instead of directly at the payload path.

Interview with Cedric Hüsler in the AEM podcast

Peter and Joey from the AEM podcast recently published their interview with Cedric Hüsler. Check out part 1, part 2 and part 3, there are a lot of interesting statements in there.

Cedric Hüsler is the director of product management for AEM for Adobe and a veteran in the WCM market. Although such a title sometimes suggests something different, Cedric is still very technical (at least sometimes he is, first-hand experience :-)) and this interview is a must-listen for all, which are interested in the ways AEM went and will go.

Thanks Peter and Joey for this podcast!

Creating the content architecture with AEM

In the last post I tried to describe the difference between the information architecture and content architecture; and from an architectural point of the view the content architecture is quite important, because based on that your application design will emerge. But how can you get to a stable and well-thought content structure?

Well, there’s no bullet-proof approach for it. When you design the content architecture for an AEM-based application it’s best to have some experience with the hierarchical approach offered by the repository approach. I will try to outline a process which might help you to get you there.
It’s not a definite guideline and I will never guarantee that it will work for you, as it is just based on my experience with the projects I did. But I hope that it will give some input and can act as a kind of checklist for you. My colleague Alex Klimetschek did a presentation at the adaptTo() conference 2012 about it.

The tree

But before we start, I want to remind you of the fact, that everything you do has to fit into the JCR tree. This tree is typically a big help, because we often think in trees (think of decision trees, divide-and-conquer algorithms, etc), also the URL is organized in a tree-ish way. Many people in IT are familiar with the hierarchical way filesystems are organized, so it’s both an comfortable and easy-to-explain approach.

Of course there are cases, where it makes things hard to model; but you are hit that problem, you should try to choose a different approach. Building any n:m relation in the AEM content tree is counter-intuitive, hard to implement and typically not really performant.

Start with the navigation

Coming from the information architecture you typically have some idea, how the navigation in the site should look like. In the typical AEM-based site, the navigation is based on the content tree; that means that traversing the first 2-3 levels of your site tree will create the navigation (tree). If you map it the other way around, you can get from the navigation to the site tree as well.

This definition definitivly has impact on your site, as now the navigation is tied to your content structure; changing one without the other is hard. So make your decision carefully.

Consider content-reuse

As the next step consider the parts of the website, which have to be identical, e.g. header and footer. You should organize your content in a way, that these central parts are maintained once for the whole site. And that any change on them can be inherited down the content tree. When you choose this approach, it’s also very easy to implement a feature, which allows you to change that content at every level, and inherit the changed content down the tree, effectively breaking the inheritance at this point.

If you are this level, also consider the fact of dispatcher invalidation. Whenever you change such a “centralized” content, it should be easily possible to purge the dispatcher cache; in the best case the activation of the changed content will trigger the invalidation of all affected pages (not more!), assuming that you have your /statefilelevel parameter set correctly.

Consider access control

As third step let’s consider the already existing structure under the aspect of access control, which you will need on the authoring environment.
On smaller sites this topic isn’t that important, because you have only a single content team, which maintains all the page. But especially in larger organizations you have multiple teams, and each team is responsible for dedicated parts of the site.

When you design your content structure, overlay the content structure with these authoring teams, and make sure, that you can avoid any situation, where a principal has write access to a page, but not to any of the child pages. While this is not always possible, try to follow this guidelines regarding access control:

  • When looking from the root node in the tree to node on a lower level, always add more privileges, but do not remove them.
  • Every author for that site should have read access to the whole site.

If you have a very complicated ACL setup (and you’ve already failed to make it simpler), consider to change your content structure at this point, and give the ACL setup a higher focus than for example the navigation.

My advice at this point: Try to make your ACL setup very easy; the more complex it gets the more time you will spend in debugging your group and permission setup to find out, what’s going on in a certain situation; also the harder it will be to explain it to your authors.

Multi-Site with MSM

As you went now through these 3 steps, you are through with it and already have some idea how your final content structure needs to look like. There is another layer of complexity if you need to maintain multiple sites using the multi-site-manager (MSM). The MSM allows you to inherit content and content structure to another site, which is typically located in a parallel sub-tree of the overall content tree. Choosing the MSM will keep your content structures consistent, which also means, that you need to plan and setup your content master (in MSM terms it is called the blueprint) in a way, that the resulting structure is well-suited for all copies of it (in MSM: live copies).

And on top of the MSM you can add more specifics, features and requirements, which also influence the content structure of your site. But let’s finish here for the moment.

When you are done with all these exercises, you already have a solid basis and considered a lot of relevant aspects. Nevertheless you should still ask others for a second opinion. Scrutiny pays really off here, because you are likely to live with this structure for a while.

Information architecture & content architecture

Recently I had a discussion in the AEM forums about how to reuse content. During this discussion I was reminded again at the importance of the way how you structure content in your repository.

For this often the term “information architecture” is used, but from my point of view that’s not 100% correct. Information architecture handles the various aspect how your website itself is structured (in terms of navigation, layout but also content). It’s most important aspect is the efficient navigation and consumption of the content on the website by end users (see the wikipedia article for it, ). But it doesn’t care about aspects like content reuse (“where do I maintain the footer navigation”), relations between websites (“how can I reduce work to maintain similar sites”), translations or access control for the editors of these systems.

Therefor I want to introduce the term “content architecture“, which deals with questions like that. The information architecture has a lot of influence, but it’s solely focused on the resulting website; the content architecture focusses on way, how such sites can be created and maintained efficiently.

In the AEM world the difference can be made visible very easily: You can see the information architecture on the website, while you can see the content architecture within CRXDE Lite. Omitting any details: The information architecture is the webpage, the content architecture the repository tree.
If you have some experience with AEM you know that the structure of the website typically matches some subtree below /content. But in the repository tree you don’t find a “header” node at the top of every subtree of a “jcr:content” node of a page, same with the footer. This piece of the resulting rendered website is taken from elsewhere, but not maintained as part of every page, although the information architecture mandates, that every page has a header and a footer.

Besides that the repository also holds a lot of other supporting “content”, which is important for a information architecture but not directly mandated by it. You have certain configuration which controls the rendering of a page; for example it might control which contact email address is displayed at the page footer. From an information architecture point of view it’s not important, where it is stored; but from a content architecture it is very important, because you might have the chance to control it at a single location, which then takes effect for all pages. Or at multiple locations, which result in changing it for individual pages. Or in a per-subtree configuration, where all pages below a certain page are affected. Depending on the requirement this will result in different content architectures.

Your information architecture will influence your content architecture (in some areas it even be a 1:1 relation), but the content architecture goes way beyond it, and deals with other “*bilities” like “manageability”, “evolvability” (how future proof is the content if there will be changes to information architecture?) or “customizability” (how flexible in terms of individualization per page/subsite is my content architecture?).

You can see, that it’s important to be aware of the content architecture, because it will have a huge influence on your application. Your application typically has a lot if built-in assumptions about the way content is structured. For example: “The child nodes below the content root node form the first-level navigation of the site”. Or “the homepage of the site uses a template called ‘homepage'” (which is btw also not covered by any information architecture, but an essential part of the content architecture).

In the JCR world there is the second rule of David’s model: “Drive the content hierarchy, don’t let it happen”. That’s the rule I quote most often, and even though it’s 10 years old, it’s still very true. Because it focusses on the aspect of managing the content tree (= content architecture), and that you should decide carefully considering the consequences of it.

And rest assured: It’s easier to change your application than to change the content tree! (At least if it’s designed properly. If it isn’t, … It’s even hard to change them both.)

AEM coding pattern: Run-mode specific code

It is very typical how have code which is supposed to run not on all environments, but only on some. For example you might have code which is supposed to import data only on the authoring instance.

Then code often looks like this:

if (isAuthoring) {
  // import data
}

boolean isAuthoring() {
  return slingSettingsService.getRunmodes().contains("author");
}

This code does what it’s supposed to do. But there can be a problem, that you want to run this code not only on authors, but (for whatever reasons) also on publish. Or you don’t want to run the code on UAT authors.
In such cases this code does not work anymore, because it’s not flexible enough (the runmode is hardcoded); any change requires a code change and deployment.

A better way is to reformulate the requirement a bit: “The data import will only run if there is the ‘importActive’ flag set to true”.

If you design this flag “importActive” as an OSGI config, and combine it with runmode dependent configuration, then you can achieve the same behaviour as above, but be much more flexible. You can even disable it (and if only for a certain time).

The code could then look like this

@Property (boolValue="true")
private static final String IMPORT_ACTIVE_PROP = "importActive";
private boolean importActive();

Protected activate(ComponentContext ctx) {
  importActive = PropertiesUtil (ctx.getProperties().get(IMPORT_ACTIVE_PROP));
}

if (importActive) {
  // import data
}

Now you translate the requirement “imports should only happen on authoring” into a configuration decision, and it’s no longer hardcoded. And that’s why the reason why I will be picky on code reviews about it.

Do not write to the repository in event handling

Repository writes are always resource intensive operations, which always come with a cost. First of all, the write operation helds a number of locks, which limit the concurrency in write operations in total.  Secondly the write itself can take some time, especially if the I/O is loaded or you are running in a cluster with MongoDB as backend; in this case it’s the latency of the network connection plus the latency of MongoDB itself. And third, every write operations causes async index updates, triggers JCR observation and Sling resource events etc. That’s the reason why you shouldn’t take write operations too easy.

But don’t be scared to write to the repo because of performance reason, no way! Instead try to avoid unnecessary writes. Either batch them and collapse multiple write operations into a single transaction, if the business case allows it. Or avoid the repository writes alltogether, especially if the information is not required to be persisted.

Recently I came across a very stressing pattern of write operations: Write amplification. A Sling Resource Event listener was listening for certain write events to the repository; and when one of these was received (which happened quite often), a Sling Job has been created to handle this event. And the job implementation just did another small change to the repository (write!) and finished.

In that case a single write operation resulted in:

  • A write operation to persist the Sling Job
  • A write operation performed by the Job implementation
  • A write operation to remove the Sling Job

Each of these “regular” write operations caused 3 subsequent writes to the repository, which is a great way to kill your write performance completely. Luckily no one of these 3 additional write operations caused the event listener to create a new Sling Job again … That would have caused the same effect as “out of office” notifications in the early days of Microsoft Exchange (which didn’t detect these and would have sent an “out-of-office” reply to the “out-of-office” sender): A very effective way of DOSing yourself!

a flood of writes and the remains of performance

But even if that was not the case, it resulted in a very loaded environment reducing the subjective performance a lot; threaddumps indicated massive lock contention on write operations. When these 3 additional writes have been optimized (effectivly removed, as collecting the information in memory and batch-writing it after 30 seconds was possible) the situation improved a lot, and the lock contention was gone.

The learning you should take away from this scenario: Avoid writing to the repository in JCR Observation or Sling event listeners! Collect the data and write outside of these handlers in a separate thread.

PS: An interesting side effect of sling event listeners taking a long time is, that these handlers are blacklisted after they took more than 5 seconds to process (e.g. because they have been blocked on writing). Then they are not fired again (until you restart AEM), if you don’t explicitly whitelist them or turn of this feature completly.