Category Archives: best practice

4 problems in projects on performance tests

Last week I was in contact with a collague of mine who was interested in my experience on performance tests we do in AEM projects.  In the last 12 years I worked with a lot of customers and implemented smaller and larger sites, but all of them had at least one of the following problems.

 (1) Lack of time

In all project plans time is allocated for performance tests, and even resources are assigned to it. But due to many unexpected problems in the project there are delays, which are very hard to compensate when the golive or release date is set and already announced to or by senior management. You typically try to bring live the best you can, and steps as performance tests are typically reduced first in time. Sometimes you are able to do performance tests before the golive, and sometimes they are skipped and postponed after golive. But in case when timelines cannot be met the quality assurance and performance tests are typical candidates which are cut down first.

(2) Lack of KPIs

When you the chance to do performance tests, you need KPIs. Just testing random functions of your website is simply not sufficient if you don’t know if this function is used at all. You might test the functionality which the least important one and miss the important ones. If you don’t have KPIs you don’t know if your anticipated load is realistic or not. Are 500 concurrent users good enough or should it rather be 50’000? Or just 50?

(3) Lack of knowledge and tools

Conducting performance tests requires good tooling; starting from the right environment (hopefully comparable sizing to production, comparable code etc) to the right tool (no, you should not use curl or invent your own tool!) and an environment where you execute the load generators. Not to forget proper monitoring for the whole setup. You want to know if you provide enough CPU to your AEM instances, do you? So you should monitor it!.

I have seen projects, where all that was provided, even licenses for the tool Loadrunner (an enterprise grade tool to do performance tests), but in the end the project was not able to use it because noone knew how to define testcases and run them in Loadrunner. We had to fall back to other tooling or the project management dropped performance testing alltogether.

(4) Lack of feedback

You conducted performance tests, you defined KPIs and you were able to execute tests and get results out of them. You went live with it. Great!
But does the application behave as you predicted? Do you have the same good performance results in PROD as in your performance test environment? Having such feedback will help you to refine your performance test, challenging your KPIs and assumptions. Feedback helps you to improve the performance tests and get better confidence in the results.

Conclusion

If you haven’t encountered these issues in your project, you did a great job avoid them. Consider yourself as a performance test professional. Or part of a project addicted to good testing. Or you are so small that you were able to ignore performance tests at all. Or you just deploy small increments, that you can validate the performance of each increment in production and redeploy a fixed version if you get into a problem.

Have you experienced different issues? Please share them in the comments.

Per folder workflows

Customizing workflows is daily business in AEM development. When in many cases it’s getting tricky when the workflow needs to behave differently based on the path. On top of that some functionality in the authoring UI is hard-wired to certain workflows. For example “Activate” will always trigger the “request for activation workflow” (referenced by path “/etc/workflow/models/request_for_activation“), and to be honest, you should not overlay this functionality and this UI control!

For cases like this there is the “Workflow Delegation Step” of ACS AEM Commons. It adds the capability to execute a new workflow model on the current payload and terminate the current one. How can this help you?

For example if you want to use a workflow “4 Eyes approval” (/etc/workflow/models/4-eyes-approval) for /content/mysite, but just a simple activation (“/etc/workflow/models/dam-replication“) without any approval for DAM Assets. But in all cases you want to use the ootb functionality and behaviour behind the “Activate” button. So you start customizing the “Request for activation” workflow in this way.

  • Add the “Workflow delegation step” as a first step into the “Request for activation” workflow and configure it like this:
    workflowModelProperty=activationWorkflow
    terminateWorkflowOnDelegation=true
  • Add a property “activationWorkflow” (String) with the value “/etc/workflow/models/4-eyes-approval” to /content/mysite.
  • Add a property “activationWorkflow” (String) with the value “/etc/workflow/models/dam-replication” to /content/dam.

And that’s it. When you now start the “Request for activation” in the path /content/mysite/page it will execute the “4 Eye approval workflow” instead! To be exact: It will start the new workflow on the payload and then terminate the default “Request for activation” workflow.

If you understand this principle, you can come up with millions different usecases:

  • Running different “asset update workflows” on different folders
  • Every tenant come up with their own set of workflow models, and they could even choose on their own which one they would like to use (if you provide them proper permissions to configure the property and some nice UI).
  • (and much more)

And all works without customizing the workflow launchers; and you need to customize the workflow models only once (adding the new first step).

Additional notes:

  • It would be great to have this workflow step as first step in every workflow by default (that means as part of the product). If you want to customize the workflow you simply copy the default one, customize it and branch to it by default (that means by adding the properties to the /content node). If you don’t customize it, nothing changes. No need to overlay/adjust product settings! (Yes, I already requested it from the AEM product engineering.)
  • We are working on a patch, which allows to specifiy this setting in /conf instead of directly at the payload path.

AEM transaction size or “do a save every 1000 nodes”

An old rule of thumb, even on earlier versions of CQ5, is “when you do large repository operations, do a session.save() every 1000 nodes”. The justification for this typically, that this is the default of the Package Manager, and therefor it’s a kind of recommended approach. And to be honest, I don’t know the real reason for it, even though I work in the Day/Adobe ecosystem for quite some time.

But nevertheless, with Oak the situation has changed a bit. Limits are much more explicit, and this rule of “every 1000 nodes do a save” can be considered still as true statement. But let me give you some background on it, why this exists at all. And then let’s find out, if this rule is still safe to use.

In the JCR specification there is the concept of transient space. This transient space holds all activities performed on a session until an implicit or explicit save() of the session. So the transient space holds all temporary data of a transaction, and the save() is comparable to the final commit of a transaction.

This transient space is typically hold inside the java heap, so dealing with it is fast.

But by definition this transaction is not bound in terms of size. So technically you should be able to create sessions, which modify all nodes and every property of a repository of 2 TB size.  This transient space does not fit into heap of a standard size (say: 12GB) any more. In order to support this behavior nevertheless, Oak starts to move this transient space entirely into the storage (TarMK, Mongo) if the transient space is getting too large (in the DocumentNodeStore language this is called a “persistent branch”, see the documentation of the DocumentNodeStore for some details on branches); then the size of the transaction is only limited by the amount of free storage on the persistance, but no longer by the size of the Java heap.

The limit is called update.limit and by default this 10k (up to and including Oak 1.4/AEM 6.2, 100k starting with Oak 1.6/AEM 6.3, see OAK-3036. But of course you can change this value using “-Doak.update.limit=40000”.

This value describes the amount of changes a transient space for a single session can hold before it is moved into the persistence. A change is a any change to a property or a node (adding/removing/modifying/reordering/…).

OK, that’s the theory, but what does this mean for you?

First, if the transient space is swapped to the persistence, the final session.save() will take much longer compared to a transient space in memory. Because to do the save, the transient space needs to be read from the persistence first (which typically includes at least disk I/O, in cases of MongoDB network I/O, which is even slower).

And second, when you add or change nodes, you typical deal with properties as well. So if you are on AEM 6.2 or older, you should check that you don’t do too much changes within a session, so you don’t hit this “10’000 changes” limit and get the performance penalty. If you have a reasonable content structure, the above mentioned rule of thumb of “do a save every 1000 nodes” goes into the very right direction.

But that’s often not good enough, because the updates of the synchronous Oak indexes count towards the 10’000 changes as well. You might know, that the synchronous indexes mirror the JCR tree, thus adding 1000 JCR nodes will also add 1000 oak nodes for the nodetype index. And that’s not the only synchronous index…

Thus increasing the update.limit to a higher number makes pretty much sense just to be on the safe side. But there is a drawback when you have such large limits: It’s the size of the transient space. Imagine you upload 1000 assets (1 MB each) into your repository in a single session, and you have the update.limit set to 100’000. The number of changes will not reach the update.limit, that’s unlikey. But your transient space will consume 1 GB of heap at least! Is your system designed and setup to handle this? Do you have enough free JVM heap?

Let’s conclude: The rule of thumb “do a save every 1000 nodes” might be a bit too optimistic on AEM 6.2 and older (with default values), but ok on AEM 6.3. But always keep the amount of transient space in mind. It can overflow your heap and debugging out-of-memory situations is not nice.

If you are interested in the inner working of Oak, look at this great piece of documentation. It covers a lot of lowlevel concepts, which are useful to know when you deal with the repository more often.

Sling healthchecks – what to monitor

The ability to know what’s going on inside an application is a major asset whenever you need to operate an application. As IT operation you don’t need (or want) to know details. But you need to know if the application works “ok” by any definition. Typically IT operations deploys alongside with the application a monitoring setting which allows to make the statement, if the application is ok or not.

In the AEM world such statements are typically made via Sling healthchecks. As a developer you can write assertions against the correct and expected behaviour of your application and expose the result. While this technical aspect is understood quite easily, the more important question is: “What should I monitor”?

In the last years and projects we typically used healthchecks for both deployment validation (if the application goes “ok” after a deployment, we continued with the next instance) and loadbalancer checks (if the application is “ok”, traffic is routed to this instance). This results in these requirements:

  • The status must not fluctuate, but rather be stable. This normally means, that temporary problems (which might affect only a single request) must no influence the result of the healthcheck. But if the error rate exceeds a certain threshold it should be reported (via healthchecks)
  • The healthcheck execution time shouldn’t be excessive. I would not recommend to perform JCR queries or other repository operations in a healthcheck, but rather consume data which is already there, to keep the execution time low.
  • A lot of the infrastructure of an AEM instance can be implicitly monitored. If you expose your healthcheck results via a page (/status.html) and this page results with a status “200 OK”, then you know, that the JVM process is running, the repo is up and that the sling script resolution and execution is working properly.
  • When you want your loadbalancer to determine the health status of your application by the results of Sling healthchecks, you must be quite sure, that every single healthcheck involved works in the right way. And that an “ERROR” in the healthcheck really means, that the application itself is not working and cannot respond to requests properly. In other words: If this healthchecks goes to ERROR on all publishs at the same time, your application is no longer reachable.

Ok, but what functionality should you monitor via healthchecks? Which part of your application should you monitor? Some recommendations.

  1. Monitor pre-requisites. If your application constantly needs connectivity to a backend system, and is not working properly without it, implement a healthcheck for it. If the configuration for accessing this system is not there or even the initial connection start fails, let your healthcheck report “ERROR”, because then the system  cannot be used.
  2. Monitor individual errors: For example, if such a backend connection throws an error, report it as warn (remember, you depend on it!). And implement an error threshold for errors, and if this threshold is reached, report it as ERROR.
  3. You should implement a healthcheck for every stateful OSGI service, which knows about “success” or “failed” operations.
  4. Try Avoid the case, that a single error is reported via multiple healthchecks. On the other side try to be as specific as possible when reporting. So instead of “more than 2% of all requests were answered with an HTTP statuscode 5xx” via healthcheck1 you should report “connection refused from LDAP server” in healthcheck 2. In many cases fundamental problems will trigger a lot of different symptoms (and therefor cause many healthchecks to fail) and it is very hard to change this behaviour. In that case you need to do document explicitly how to react in such responses and how to find the relevant healthcheck quickly.

Regarding the reporting itself you can report every problem/exception/failure or work with the threshold.

  • Report every single problem if the operations runs rarely. If you have a scheduled task with a daily interval, the healthcheck should report immediately if it fails. Also report “ok” if it works well again. The time between runs should give enough time to react.
  • If your operation runs very often (e.g. as part of every request) implement a threshold and report only a warning/error if this threshold is currently exceeded. But try to avoid constantly switching between “ok” and “not ok”.

I hope that this article gives you some idea how you should work with healthchecks. I see them as a great tool and very easy to implement. They are ideal to perform one-time validations (e.g. as smoketests after a deployment) and also for continous observation of an instance (by feeding healthcheck data to loadbalancers). Of course they can only observe the things you defined up front and do not replace testing. But they really speedup processes.

AEM 6.3: set admin password on initial startup (Update)

With AEM 6.3 you have the chance to setup the admin password already on the initial start. By default the quickstart asks you for the password if you start it directly. That’s a great feature and shortens quite some deployment instructions, but it doesn’t work always.

For example, if you first unpack your AEM instance and then use the start script, you’ll never get asked for the new admin password. The same if you work in an application server setup. And if you do automatic installations, you don’t want to get asked at all.

But I found, that even in these scenarios you can set the admin password as part of the initial installation. There are 2 different ways:

  1. Set the system property “admin.password” to your desired password; and it will be used (for example add “-Dadmin.password=mypassword” to the JVM parameters).
  2. Or set the system property “admin.password.file” and pass as value the path to a file; when this file is accessible by the AEM instance and the contains the line “admin.password=myAdminPassword“, this value will be used as admin password.

Please note, that this only works on the initial startup. On all subsequent startups these system properties are ignored; and you should probably remove them or at least purge the file in case of (2).

Update: Ruben Reusser mentioned, that the Osgi Webconsole Admin password is not changed (which is used in case the repository is not running). So you still need to work on that.

AEM coding best practice: No String operations on paths!

I recently needed to review project code in order to validate if it makes problems when upgrading from AEM 5.6 to AEM 6.x; so my focus wasn’t on the code in the first place, but on some other usual suspects (JCR queries etc). But having seen a few dozen classes I found a pattern, which I then found all over the code: the excessive use of String operations. With a special focus on string operations on repository paths.

For example something like this:

String[] segments = resource.getPath().split("/");
String settingsPath = "/" + StringUtils.join(segments,"/",0,2) + "/settings/jcr:content";
Resource settings = resourceResolver.get(settingsPath);
ValueMap vm = settings.adaptTo(ValueMap.class);
String language = vm.get("language");

(to read settings, which are stored in a dedicated page per site).

Typically it’s a weird mixture of String methods, the use of StringUtils classes plus some custom helpers, which do things with ResourceResolvers, Sessions and paths. Spread all over the codebase. Ah, and it lacks a lot of error checking (what if the settings page doesn’t contain the “language” property? adaptTo() can return “null”).

Sadly that problem not limited to this specific project code, I found it in many other projects as well.

Such a code structure is a clear sign for the lack of abstraction and guidance. There are no concepts available, which eliminate the need to operate on strings, but the developer is left with the only abstraction he has: The repository and CRXDE Lite’s view on it. He logs into the repository, looks at the structure and then decides how to mangle known pieces of information to get hold of the things he needs to access. If there’s noone which reviews the code and rejects such pieces, the pattern emerges and you can find it all over the codebase. Even if developers create utility classes for (normally every developer creates one on its own), it’s a bad approach, because these concepts are not designed (“just read the language from the settings page!”), but the design “just happens“; there is no focus on it, therefor quality is not enforced and error handling typically quite poor.

Another huge drawback of this approach: It’s very hard to change the content structure, because at many levels assumptions about the content structure are used, which are often not directly visible. For the example the constants “0” and “2” in the above code snippets determine the path prefix, but you cannot search for such definitions (even if they are defined as constant values).

If the proper abstraction would be provided, the above mentioned code could look like this:

String language = "en";
Settings settings = resource.adaptTo(Settings.class);
If (settings != null) {
  language = settings.getLanguage();
}

This approach is totally agnostic of paths and the fact that settings are stored inside a page. It hides this behind a well-defined abstraction. And if you ever need to change the content structure you know where you need to change your code: Only in the AdapterFactory.

So whenever you see code which uses String operations on paths: Rethink it. Twice. And then come up with some abstraction layer and refactor it. It will have a huge impact on the maintainability of your code base.

AEM anti pattern: Long running sessions

AEM 6.x comes with Apache Oak, which features the use of the MVCC principle. MVCC (multi version concurrency control) is a principle, which gives you an view on a certain state within the repository. This state does not change, but can be considered as immutable. If you want to perform a change on the repository, the change is performed against this state and the applied (merged) to the HEAD state (which is the most current state within the repository). This merge is normally not a problem, if the state of the session doesn’t differ too much from the HEAD state; in case the merge fails, you get an OakMerge exception.

Note, that this is change compared to Jackrabbit 2.x and CRX 2.x, where the state of a session was always update, and where these merge exception never happened. This also means, that you might need to change your code to make it work well with Oak!

If you have long-running sessions, the probability of such an OakMerge exceptions is getting higher and higher. This is due to other changes happening in the repository, which could affect also the areads where your session wants to perform its changes. This is a problem especially in cases, where you run a service, which opens a session in the activate() method and closes it in deactivate() and uses it to save data to the repository as well. These are rare cases (because they are discouraged since years), but they still exist.

The problem is, that if a save() operations fails due to such an OakMerge exception, the temporary space of that session is polluted. The temporary space of a session is heap memory, where all the changes are stored, which are about to get saved. A successfully session.save() cleans that space afterwards, but if an exception happens this space is not cleaned. And if a session.save() fails because of such OakMerge exceptions, any subsequent session will fail as well.

Such an exception could like this (relevant parts only):

Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /content/geometrixx/en/services/jcr:content
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:237)
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:212)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.newRepositoryException(SessionDelegate.java:672)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.save(SessionDelegate.java:539)
at org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.save(ItemDelegate.java:141)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(ItemImpl.java:262)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(ItemImpl.java:259)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:294)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:113)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.save(ItemImpl.java:259)
at org.apache.jackrabbit.oak.jcr.session.NodeImpl.save(NodeImpl.java:99)

There are 2 ways to mitigate this problem:

  • Avoid long running sessions and replace them by a number of short-living sessions. This is the way to go and in most cases the easiest solution to implement. This also avoids the problems coming with shared sessions.
  • Add code to call session.refresh(true) before you do your changes. This refreshes the session state to the HEAD state, exceptions are less likely then. If you run into a RepositoryException you should explicitly cleanup your transient space using session.refresh(false); then you’ll loose your changes, but the next session.save() will not fail for sure. This the path you should choose when you cannot create new sessions.