Category Archives: best practice

Sling Context-Aware configuration

In the previous blog posting I outlined why there’s a need to store some configuration outside of OSGI and I named Sling Context-Aware Configuration (CA-Config) as a way to implement it.

CA-Config is a relative new feature of AEM (introduced with AEM 6.2) which stores the configuration inside the repository; but besides that it offers some features, which are handy in a lot of usecases:

  • You don’t deal with nodes and resources anymore, but only with POJOs. This kind of implementation completly hides where the configuration is stored, but the location of the config is stored as property on the node.
  • CA-Config implements an approach based on the repository hierarchy. If a configuration value is not found at the specified config location, the hierarchy is walked towards the root and further configurations are consulted, until a configuration is found or the root node is hit (in that case a default value for the configuration should be used).
  • the lookup process is very configurable, although the default one is sufficient in most cases.

A good example for this are content-related settings in a multi-tenant system. Imagine a multi-country and multi-language use-case (where for some countries multiple languages have to be supported); and in each country you should have different user groups, which act as approver in the the content activation process. In the past this was often done by storing it as part of the pages itself, but from a permission point of view it was possible for the authors to change it themselves (which typically only causes trouble). But with CA-Config this can be stored outside of the /content tree in /conf, but closely aligned to the content, and being able to match the weirdest requirements from a structural point of view; that means it is easily possible then have in one country 3 different approver groups for sections of the page, while in another just having 1 group covering all.

Another large usecase are sites, which are based on MSM livecopies. There is always content which is site-specific (e.g. contact adresses, names, logos, etc). And if this content is baked into a”regular” content page which is rolled out by MSM, the blueprint content will always overwrite the content which is specific to the livecopy. With CA-config it is easy possible to maintain these site-specific configurations outside of the regular content tree.

Before CA-Config all of this was implemented for very simple usecases using the InheritanceValueMap, which also implemented such a “walk-up-the-tree-if-the-property-is-not-found” algorithm. But it only looked up the direct parents and did not consider more structured property data as it is possible with CA-Config.

And there were a lot of custom implementations, which created “per site” configuration pages, typically next to the “homepage” page, and restricted access to it via ACLs or dispatcher/webserver/mapping rules. And of course these continue to work, but the access to these configuration data could be managed via the CA-Config as well (depending on the way how the data is stored either by just specifiying these as config path or by a custom lookup logic which can be configured for the CA-Config). And on top of that, these per-site configurations were per-site and not supporting the above mentioned case of having 3 different approver on different parts of the site.

So CA-Config is very flexible and efficient way to retrieve configuration data, which is stored in the repository. But it’s just for retrieving, there’s no prescribed way how the configuration data gets there.
While AEM comes with a configuration editor, it does not support writing the settings for CA-Config (at the moment it’s only designed for editable templates and workflows). WCM.io has a content package for download which allows you to edit these configurations in the authoring UI.

In the next post I will outline how easy it is to actually use CA-Config.

4 problems in projects on performance tests

Last week I was in contact with a collague of mine who was interested in my experience on performance tests we do in AEM projects.  In the last 12 years I worked with a lot of customers and implemented smaller and larger sites, but all of them had at least one of the following problems.

 (1) Lack of time

In all project plans time is allocated for performance tests, and even resources are assigned to it. But due to many unexpected problems in the project there are delays, which are very hard to compensate when the golive or release date is set and already announced to or by senior management. You typically try to bring live the best you can, and steps as performance tests are typically reduced first in time. Sometimes you are able to do performance tests before the golive, and sometimes they are skipped and postponed after golive. But in case when timelines cannot be met the quality assurance and performance tests are typical candidates which are cut down first.

(2) Lack of KPIs

When you the chance to do performance tests, you need KPIs. Just testing random functions of your website is simply not sufficient if you don’t know if this function is used at all. You might test the functionality which the least important one and miss the important ones. If you don’t have KPIs you don’t know if your anticipated load is realistic or not. Are 500 concurrent users good enough or should it rather be 50’000? Or just 50?

(3) Lack of knowledge and tools

Conducting performance tests requires good tooling; starting from the right environment (hopefully comparable sizing to production, comparable code etc) to the right tool (no, you should not use curl or invent your own tool!) and an environment where you execute the load generators. Not to forget proper monitoring for the whole setup. You want to know if you provide enough CPU to your AEM instances, do you? So you should monitor it!.

I have seen projects, where all that was provided, even licenses for the tool Loadrunner (an enterprise grade tool to do performance tests), but in the end the project was not able to use it because noone knew how to define testcases and run them in Loadrunner. We had to fall back to other tooling or the project management dropped performance testing alltogether.

(4) Lack of feedback

You conducted performance tests, you defined KPIs and you were able to execute tests and get results out of them. You went live with it. Great!
But does the application behave as you predicted? Do you have the same good performance results in PROD as in your performance test environment? Having such feedback will help you to refine your performance test, challenging your KPIs and assumptions. Feedback helps you to improve the performance tests and get better confidence in the results.

Conclusion

If you haven’t encountered these issues in your project, you did a great job avoid them. Consider yourself as a performance test professional. Or part of a project addicted to good testing. Or you are so small that you were able to ignore performance tests at all. Or you just deploy small increments, that you can validate the performance of each increment in production and redeploy a fixed version if you get into a problem.

Have you experienced different issues? Please share them in the comments.

Per folder workflows

Customizing workflows is daily business in AEM development. When in many cases it’s getting tricky when the workflow needs to behave differently based on the path. On top of that some functionality in the authoring UI is hard-wired to certain workflows. For example “Activate” will always trigger the “request for activation workflow” (referenced by path “/etc/workflow/models/request_for_activation“), and to be honest, you should not overlay this functionality and this UI control!

For cases like this there is the “Workflow Delegation Step” of ACS AEM Commons. It adds the capability to execute a new workflow model on the current payload and terminate the current one. How can this help you?

For example if you want to use a workflow “4 Eyes approval” (/etc/workflow/models/4-eyes-approval) for /content/mysite, but just a simple activation (“/etc/workflow/models/dam-replication“) without any approval for DAM Assets. But in all cases you want to use the ootb functionality and behaviour behind the “Activate” button. So you start customizing the “Request for activation” workflow in this way.

  • Add the “Workflow delegation step” as a first step into the “Request for activation” workflow and configure it like this:
    workflowModelProperty=activationWorkflow
    terminateWorkflowOnDelegation=true
  • Add a property “activationWorkflow” (String) with the value “/etc/workflow/models/4-eyes-approval” to /content/mysite.
  • Add a property “activationWorkflow” (String) with the value “/etc/workflow/models/dam-replication” to /content/dam.

And that’s it. When you now start the “Request for activation” in the path /content/mysite/page it will execute the “4 Eye approval workflow” instead! To be exact: It will start the new workflow on the payload and then terminate the default “Request for activation” workflow.

If you understand this principle, you can come up with millions different usecases:

  • Running different “asset update workflows” on different folders
  • Every tenant come up with their own set of workflow models, and they could even choose on their own which one they would like to use (if you provide them proper permissions to configure the property and some nice UI).
  • (and much more)

And all works without customizing the workflow launchers; and you need to customize the workflow models only once (adding the new first step).

Additional notes:

  • It would be great to have this workflow step as first step in every workflow by default (that means as part of the product). If you want to customize the workflow you simply copy the default one, customize it and branch to it by default (that means by adding the properties to the /content node). If you don’t customize it, nothing changes. No need to overlay/adjust product settings! (Yes, I already requested it from the AEM product engineering.)
  • We are working on a patch, which allows to specifiy this setting in /conf instead of directly at the payload path.

AEM transaction size or “do a save every 1000 nodes”

An old rule of thumb, even on earlier versions of CQ5, is “when you do large repository operations, do a session.save() every 1000 nodes”. The justification for this typically, that this is the default of the Package Manager, and therefor it’s a kind of recommended approach. And to be honest, I don’t know the real reason for it, even though I work in the Day/Adobe ecosystem for quite some time.

But nevertheless, with Oak the situation has changed a bit. Limits are much more explicit, and this rule of “every 1000 nodes do a save” can be considered still as true statement. But let me give you some background on it, why this exists at all. And then let’s find out, if this rule is still safe to use.

In the JCR specification there is the concept of transient space. This transient space holds all activities performed on a session until an implicit or explicit save() of the session. So the transient space holds all temporary data of a transaction, and the save() is comparable to the final commit of a transaction.

This transient space is typically hold inside the java heap, so dealing with it is fast.

But by definition this transaction is not bound in terms of size. So technically you should be able to create sessions, which modify all nodes and every property of a repository of 2 TB size.  This transient space does not fit into heap of a standard size (say: 12GB) any more. In order to support this behavior nevertheless, Oak starts to move this transient space entirely into the storage (TarMK, Mongo) if the transient space is getting too large (in the DocumentNodeStore language this is called a “persistent branch”, see the documentation of the DocumentNodeStore for some details on branches); then the size of the transaction is only limited by the amount of free storage on the persistance, but no longer by the size of the Java heap.

The limit is called update.limit and by default this 10k (up to and including Oak 1.4/AEM 6.2, 100k starting with Oak 1.6/AEM 6.3, see OAK-3036. But of course you can change this value using “-Doak.update.limit=40000”.

This value describes the amount of changes a transient space for a single session can hold before it is moved into the persistence. A change is a any change to a property or a node (adding/removing/modifying/reordering/…).

OK, that’s the theory, but what does this mean for you?

First, if the transient space is swapped to the persistence, the final session.save() will take much longer compared to a transient space in memory. Because to do the save, the transient space needs to be read from the persistence first (which typically includes at least disk I/O, in cases of MongoDB network I/O, which is even slower).

And second, when you add or change nodes, you typical deal with properties as well. So if you are on AEM 6.2 or older, you should check that you don’t do too much changes within a session, so you don’t hit this “10’000 changes” limit and get the performance penalty. If you have a reasonable content structure, the above mentioned rule of thumb of “do a save every 1000 nodes” goes into the very right direction.

But that’s often not good enough, because the updates of the synchronous Oak indexes count towards the 10’000 changes as well. You might know, that the synchronous indexes mirror the JCR tree, thus adding 1000 JCR nodes will also add 1000 oak nodes for the nodetype index. And that’s not the only synchronous index…

Thus increasing the update.limit to a higher number makes pretty much sense just to be on the safe side. But there is a drawback when you have such large limits: It’s the size of the transient space. Imagine you upload 1000 assets (1 MB each) into your repository in a single session, and you have the update.limit set to 100’000. The number of changes will not reach the update.limit, that’s unlikey. But your transient space will consume 1 GB of heap at least! Is your system designed and setup to handle this? Do you have enough free JVM heap?

Let’s conclude: The rule of thumb “do a save every 1000 nodes” might be a bit too optimistic on AEM 6.2 and older (with default values), but ok on AEM 6.3. But always keep the amount of transient space in mind. It can overflow your heap and debugging out-of-memory situations is not nice.

If you are interested in the inner working of Oak, look at this great piece of documentation. It covers a lot of lowlevel concepts, which are useful to know when you deal with the repository more often.

Sling healthchecks – what to monitor

The ability to know what’s going on inside an application is a major asset whenever you need to operate an application. As IT operation you don’t need (or want) to know details. But you need to know if the application works “ok” by any definition. Typically IT operations deploys alongside with the application a monitoring setting which allows to make the statement, if the application is ok or not.

In the AEM world such statements are typically made via Sling healthchecks. As a developer you can write assertions against the correct and expected behaviour of your application and expose the result. While this technical aspect is understood quite easily, the more important question is: “What should I monitor”?

In the last years and projects we typically used healthchecks for both deployment validation (if the application goes “ok” after a deployment, we continued with the next instance) and loadbalancer checks (if the application is “ok”, traffic is routed to this instance). This results in these requirements:

  • The status must not fluctuate, but rather be stable. This normally means, that temporary problems (which might affect only a single request) must no influence the result of the healthcheck. But if the error rate exceeds a certain threshold it should be reported (via healthchecks)
  • The healthcheck execution time shouldn’t be excessive. I would not recommend to perform JCR queries or other repository operations in a healthcheck, but rather consume data which is already there, to keep the execution time low.
  • A lot of the infrastructure of an AEM instance can be implicitly monitored. If you expose your healthcheck results via a page (/status.html) and this page results with a status “200 OK”, then you know, that the JVM process is running, the repo is up and that the sling script resolution and execution is working properly.
  • When you want your loadbalancer to determine the health status of your application by the results of Sling healthchecks, you must be quite sure, that every single healthcheck involved works in the right way. And that an “ERROR” in the healthcheck really means, that the application itself is not working and cannot respond to requests properly. In other words: If this healthchecks goes to ERROR on all publishs at the same time, your application is no longer reachable.

Ok, but what functionality should you monitor via healthchecks? Which part of your application should you monitor? Some recommendations.

  1. Monitor pre-requisites. If your application constantly needs connectivity to a backend system, and is not working properly without it, implement a healthcheck for it. If the configuration for accessing this system is not there or even the initial connection start fails, let your healthcheck report “ERROR”, because then the system  cannot be used.
  2. Monitor individual errors: For example, if such a backend connection throws an error, report it as warn (remember, you depend on it!). And implement an error threshold for errors, and if this threshold is reached, report it as ERROR.
  3. You should implement a healthcheck for every stateful OSGI service, which knows about “success” or “failed” operations.
  4. Try Avoid the case, that a single error is reported via multiple healthchecks. On the other side try to be as specific as possible when reporting. So instead of “more than 2% of all requests were answered with an HTTP statuscode 5xx” via healthcheck1 you should report “connection refused from LDAP server” in healthcheck 2. In many cases fundamental problems will trigger a lot of different symptoms (and therefor cause many healthchecks to fail) and it is very hard to change this behaviour. In that case you need to do document explicitly how to react in such responses and how to find the relevant healthcheck quickly.

Regarding the reporting itself you can report every problem/exception/failure or work with the threshold.

  • Report every single problem if the operations runs rarely. If you have a scheduled task with a daily interval, the healthcheck should report immediately if it fails. Also report “ok” if it works well again. The time between runs should give enough time to react.
  • If your operation runs very often (e.g. as part of every request) implement a threshold and report only a warning/error if this threshold is currently exceeded. But try to avoid constantly switching between “ok” and “not ok”.

I hope that this article gives you some idea how you should work with healthchecks. I see them as a great tool and very easy to implement. They are ideal to perform one-time validations (e.g. as smoketests after a deployment) and also for continous observation of an instance (by feeding healthcheck data to loadbalancers). Of course they can only observe the things you defined up front and do not replace testing. But they really speedup processes.

AEM 6.3: set admin password on initial startup (Update)

With AEM 6.3 you have the chance to setup the admin password already on the initial start. By default the quickstart asks you for the password if you start it directly. That’s a great feature and shortens quite some deployment instructions, but it doesn’t work always.

For example, if you first unpack your AEM instance and then use the start script, you’ll never get asked for the new admin password. The same if you work in an application server setup. And if you do automatic installations, you don’t want to get asked at all.

But I found, that even in these scenarios you can set the admin password as part of the initial installation. There are 2 different ways:

  1. Set the system property “admin.password” to your desired password; and it will be used (for example add “-Dadmin.password=mypassword” to the JVM parameters).
  2. Or set the system property “admin.password.file” and pass as value the path to a file; when this file is accessible by the AEM instance and the contains the line “admin.password=myAdminPassword“, this value will be used as admin password.

Please note, that this only works on the initial startup. On all subsequent startups these system properties are ignored; and you should probably remove them or at least purge the file in case of (2).

Update: Ruben Reusser mentioned, that the Osgi Webconsole Admin password is not changed (which is used in case the repository is not running). So you still need to work on that.

AEM coding best practice: No String operations on paths!

I recently needed to review project code in order to validate if it makes problems when upgrading from AEM 5.6 to AEM 6.x; so my focus wasn’t on the code in the first place, but on some other usual suspects (JCR queries etc). But having seen a few dozen classes I found a pattern, which I then found all over the code: the excessive use of String operations. With a special focus on string operations on repository paths.

For example something like this:

String[] segments = resource.getPath().split("/");
String settingsPath = "/" + StringUtils.join(segments,"/",0,2) + "/settings/jcr:content";
Resource settings = resourceResolver.get(settingsPath);
ValueMap vm = settings.adaptTo(ValueMap.class);
String language = vm.get("language");

(to read settings, which are stored in a dedicated page per site).

Typically it’s a weird mixture of String methods, the use of StringUtils classes plus some custom helpers, which do things with ResourceResolvers, Sessions and paths. Spread all over the codebase. Ah, and it lacks a lot of error checking (what if the settings page doesn’t contain the “language” property? adaptTo() can return “null”).

Sadly that problem not limited to this specific project code, I found it in many other projects as well.

Such a code structure is a clear sign for the lack of abstraction and guidance. There are no concepts available, which eliminate the need to operate on strings, but the developer is left with the only abstraction he has: The repository and CRXDE Lite’s view on it. He logs into the repository, looks at the structure and then decides how to mangle known pieces of information to get hold of the things he needs to access. If there’s noone which reviews the code and rejects such pieces, the pattern emerges and you can find it all over the codebase. Even if developers create utility classes for (normally every developer creates one on its own), it’s a bad approach, because these concepts are not designed (“just read the language from the settings page!”), but the design “just happens“; there is no focus on it, therefor quality is not enforced and error handling typically quite poor.

Another huge drawback of this approach: It’s very hard to change the content structure, because at many levels assumptions about the content structure are used, which are often not directly visible. For the example the constants “0” and “2” in the above code snippets determine the path prefix, but you cannot search for such definitions (even if they are defined as constant values).

If the proper abstraction would be provided, the above mentioned code could look like this:

String language = "en";
Settings settings = resource.adaptTo(Settings.class);
If (settings != null) {
  language = settings.getLanguage();
}

This approach is totally agnostic of paths and the fact that settings are stored inside a page. It hides this behind a well-defined abstraction. And if you ever need to change the content structure you know where you need to change your code: Only in the AdapterFactory.

So whenever you see code which uses String operations on paths: Rethink it. Twice. And then come up with some abstraction layer and refactor it. It will have a huge impact on the maintainability of your code base.