How to use Runmodes correctly

Runmodes are an essential concept within AEM; they form the main and only way to assign roles to AEM instances; the primary usecase is to distinguish between the author and publish role, and another common usecase is also to split between PROD, Staging and Development environments. Technically it’s just a set of strings which are assigned to an instance, and which are used by the Sling framework at a few occassions, the most prominent being the Sling JCR Installer (which handles the /apps/myapp/config,/apps/myapp/config.author, etc. directories).

But I see other usecases; usecases where the runmodes are fetched and compared against hardcoded strings. A typical example for it:

boolean isAuthor() {
return slingSettingsService.getRunmodes().contains("author");
}

From a technical point of view this is fully correct, and works as expected. The problem arises when some code is based on the result of this method:

if (isAuthor()) {
// do something
}

Because now the execution of this code is hardcoded to the author environment; which can get problematic, if this code must not be executed on the DEV authoring instances (e.g. because it sends email notifications). It is not a problem to change this to:

if (isAuthor() && !isDevelopmentEnvironment()) {
// do something
}

But now it is hardcoded again 😦

The better way is to rely on the OSGI framework soley. Just make your OSGI components require some configuration and define the configuration for the runmodes required.

@Component(configurationPolicy=ConfigurationPolicy.REQUIRED)
public class myServiceImpl implements myService {
//...

This case requires NO CODING at all, instead you can just use the functionality provided by Sling. And this component does not even activate if the configuration is not present!

Long story short: Whenever you see a reference to SlingSettingsService.getRunmodes(), it’s very likely used wrongly. And we can generalize it to “If you add a reference to the SlingSettingsService, you are doing something wrong”.

There are only a very few cases where the information provided by this service is actually useful for non-framework purposes. But I bet you are not writing a framework 🙂

You think that I have written on that topic already? Yes, I did, actually 2 times already (here and here). But it seems that only repetition helps to get this message through. I still find this pattern in too many codebases.

A “no custom code challenge” for AEM?

My colleague Jan Exner initiated a “no custom code challenge” for the Analytics area earlier this year; and in the followup of this the people of 33sticks posted a good summary why it would be much better if you could avoid any custom code in the analytics world.

I am wondering if this holds true for AEM systems as well. On the one hand side customization is required. For example you need to style the components according to the requirements and styleguides. But on the other hand siede, excessive customization (overlays and adaptions/changes to ootb functionality) leads to maintenance and upgrade issues. But maybe we should not use the term “customization” anymore in the AEM world, but choose a more appropriate one, maybe “application development on AEM”, because that’s what we do in reality quite often.

And the application development part is the one which makes software expensive. It requires design, architecture, implementors, tests, automated tests, deployments. It requires management and comes with risk. The more application development we have, the higher the risk and the costs.

If you were able to avoid any application development in an AEM project, and just live with the core components components and brand them accordingly, that would be great. We would only focus on style and branding of the components, no need to Java developers and code deployments. Just pure frontend, and a clever use of the out-of-the-box tools AEM offers you.

I am truly convinced that you can build a standard marketing site (multi-site, multi-language, integrated translation etc) with this approach. It requires dicussion with the business and more important, you as a developer or architect need to urge yourself not write any code.

Of course, it’s probably getting a very basic site, but it can serve 2 purposes:

  • We identify what should really be part of AEM (which is something we can and should add asap)
  • We challenge ourselves to think in much simple structures and less customizations. I always wonder how easy the statement “then let’s overlay it” comes out of the mouth of an AEM consultant in a discussion, and I am no exception to this.

Yes, can we join Jan’s initiative. With AEM it’s definitely harder to achieve this than with other solutions of the Adobe Experience Cloud, but it’s doable. And honestly, we should accept such challenges more often. Even if we eventually fail.

But the learning is immense.

Understanding the “Oak Repository Statistics” MBean

In the last releases AEM has been greatly enhanced to provide information which are suitable for health detection. Especially Oak provides a huge amout of MBeans which can be monitored. But sometimes they are a bit hard to understand. Based on some ongoing activities I digged through the “Oak Repository statistics” MBean and found it quite useful, even for some basic understanding and analysis.

I did this analysis and the screenshot on AEM 6.5, but this MBean is present at least since AEM 6.1 (probably even 6.0) and its content hasn’t changed much.

When you access this MBean the top of the page looks like this (this instance has just started):

Oak Repository Statistics MBean

There are a number of values collected, and presented for a number of times:

  • per second: The raw value in each second for the last minute.
  • per minute: The aggregated value on a minute basis for the last hour
  • per hour: The aggregated value on a hourly basis for the last day
  • per day: the aggregated value on a daily basis

The aggregation differs based on the type of the metric:

  • Gauge: This is a simple value, which is not further processed. When values of types must be aggregated, an average is calculated.
  • Counter: This a number, which can be accumulated. When values of type counter must be aggregated, they are summed up.
Attribute NameTypeDescription
SessionCountGaugeThe number of JCR sessions, which are currently open
SessionLoginGaugeThe number of sessions opened within that time
SessionReadCountCounterThe number of read operations in the JCR (over all sessions)
SessionReadDurationGaugeThe total time spent in read operations (nanoseconds)
SessionReadAverageGaugethe average duration for read operations (SessionReadDuration divided by the number of reads)
SessionWriteCountCounterThe number of write operations in the JCR (over all sessions). Be aware that session.refresh() is also counted as write operation!
SessionWriteDurationGaugeTotal time spent writing to sessions in nanoseconds
SessionWriteAverageGaugethe average duration for write operations (SessionWriteDuration divided by the number of writes)
QueryCountCounterThe number of JCR queries executed
QueryDurationGaugeThe total time spent in JCR query operations (miliseconds)
QueryAverageGaugethe average duration for queries (QueryDuration divided by the number of writes)
ObservationEventCountCounterThe number of observation events delivered to all listeners
ObservationEventDurationGaugeThe total time spent processing events by all observation listeners in nanoseconds
ObservationEventAverageGaugethe average duration spent processing observation events (ObservationEventDuration divided by the number of events)
ObservationQueueMaxLengthGaugeThe maximum length of the JCR Observation Queue; in newer Oak versions this queue does no longer exist, and then the value is -1

This measurement is done to limit the amount of data which needs to be stored. And this data is stored within the JVM inside the Oak bundles; that means that any restart of the JVM or restart of the Oak bundles will reset these values. If you want to persist these values you need to read them via JMX and store them.

Ok, what can you do with all this data? Well, it can help you to answer many questions. For example you can find out very easy, if you have a session leak.Because then numbers ion the SessionCount attribute always increase over time. It’s also interesting to find out what is happening within your system when it’s completely idle. Are there repository writes which you are not expecting? Queries every few seconds?

If you are investigating performance issues, or if you want to avoid them, you should have a look into this MBean.

Update: Fixed the only link in this post. Thanks Oswald for reporting!

“We have an urgent performance issue” (part 2)

As a reaction of the last post I got the question by Oswaldo about specific recommendations on performance. Actually, there are a lot. But that’s material for another blog post 🙂 or skip to the bottom of this post.

Instead I want to give you a recommendation on how to handle situations when you did not have time nor capacity you can spend on thinking about performance and response times. But as an experienced technical leader you know that at some point this question will arise for sure. You might get a few hours to spend on that question, but how do you spend it most efficiently?

Clearly not for performance optimization! Because it’s not enough time to analyze and improve substantial parts of the application. And tomorrows changes might render these improvements useless…

Instead I would recommend you to spend this time in communication and building rapport with people who can help you in case such a performance problem arises. Get in contact with the operations people which are operating your system and application. Understand how they work and what tools they use. Understand how they can help you in case of performance issues, what information they can provide to you. Ask for an account on their monitoring system, just to demonstrate interest in their work and problems. And potentially give them some tips what they can additionally do to improve the quality of the information (for example asking if they can also provide the raw data and not only the visualization based on aggregated data). Or show them some hints how they improve their work with your application.

The biggest value in that activity is the fact, that in case the dreaded performance issue is noticed on an exec level, you already know who to talk to. You know a bit how the others are working and how you can help them. As a tech lead it’s then much easier to ask for logfiles, traffic patterns, CPU usage graphs and I/O latencies, threaddumps etc. You know upfront what information it operations already collects by default. You might have direct access to a monitoring system to get more information. You can even get a warning from the ops people in advance that some real big escalation is imminent. For me this is the best you can get if you have just a few hours to spend.

You might ask why is that important. Because it reduces the TTAD (time to actionable data) dramatically in case of such performance issues. You know who to get on the phone and into calls to start investigation. You already know what information is already available or you can even access it directly. You can report “We are analyzing data and can come up with first suggestions within the day” instead of “we are talking to IT and see how they can support us to get data”.

That’s much more important than spending some hours on random performance tuning. And in case you ever run into performance issues, these hours are one of the best investment you made in the whole project.

(And as random recommendation to improve AEM request rendering times: Disable the MobileRedirectFilter (PID: com.day.cq.wcm.mobile.core.impl.redirect.RedirectFilter) by setting the configuration parameter “redirect.enabled” to “false”. In the age of responsive websites it’s purpose is no longer given. And under load its performance impact can be significant.)

“We have an urgent performance problem!”

“Customer situation is heating up because they have urgent performance problems in their production environment” … That’s something I heard quite often in my consulting career. And in many cases it was an outcry for help, because this problem did not turn out during tests, but most times in production environments.

I read once a nice quote: “Everyone has a performance test environment. But only a few have a production environment!” So true.

Is that really inevitable that performance issues occur? Given the number of cases I’ve seen, I am inclined to believe it. But it is not.

I think, that if all of a sudden a performance problem is put on priority 1, it has a history. If you are in a project team and one morning your project lead/PO tells you that the priorities have shifted, and that you need to push that little unknown item “Performance tuning” from the bottom of the backlog to the top of the list, you were aware of performance as topic. But other features were considered more important.

Or if your customer is starting to escalate with your account team that a really bad application (the one you developed for them) performance is affecting their business, then you often know, that this is not a new issue.

In both cases the priority of this problem just hit a certain level, that executives are getting concerned and start to escalate this topic, because it is hurting their and/or the company’s goals. Here you go, yesterday everything was fine, today it’s all screwed.

All of these problems have a history; performance problems rarely just rise out of nowhere, but they evolve under the radar of ignorance. As long as noone complains, who cares about performance? Features are more important. Until the complaints get loud enough they cannot be overheard anymore.

But when you get at the point where you need to care about performance, you are often in a very bad position. Because now you need information, tools and processes to improve that situation very fast. Because are in the focus everyone is looking to your problem (it’s yours!). “Deliver a fast improvement! Within this week!

But if you have never prepared for that situation, you lack all the necessary things for such an operation:

  • You don’t have KPIs to know what is “acceptable” performance. You just have everyone’s feeling “it’s slow!”
  • You don’t have the tools to measure the current performance. You just have logfiles (hopefully you have them …)
  • Is your system actually able to deliver that performance?
  • “Has anyone got the reports from the latest performance tests?”

That’s a hard situation and there is barely any other way than just to take a bunch of good people and start a surgery in production: Analyzing lot of data, making guesses, increasing logging, deploying hotfixes, and do all the things you already accumulated list on your list of “things which might improve performance” which you maintained over the last months.

But let’s be honest: This is chaos, unplanned, affecting a lot of other teams and people, and hurting careers. But does it really have to be that way?

No. Application performance is no magic, but must be managed. Performance should always be an important aspect in your project, and resources should be spent on it as you spend resources on testing. Otherwise performance is ignored 90% of the time, and in the remaining 10% you are escalating because of the lack of it.

Roles & Rights and complexity

An important aspect in every project is the roles and rights setup. Typically this runs side-by-side with all other discussions regarding content architecture and content reuse.

While in many projects the roles and rights setup can be implemented quite straight forward, there are cases where it gets complicated. In my experience this most often happens with companies which have a very strong separation of concerns, where a lot of departments with varying levels of access are supposed to have access to AEM. Where users should be able to modify pages, but not subpages; where they are supposed to change text, but not the images on the pages. Translators should be able to change the text, but not the structure. And many things more.

I am quite sure that you can implement everything with the ACL structure of AEM (well, nearly everything), but in complex cases it often comes with a price.

Performance

ACL evaluation can be costly in terms of performance, if a lot of ACLs needs to be checked; and especially if you have globally active wildcard ACLs. As every repository acecss runs through it, it can affect performance.

There is not hard limit in number of allowed ACLs, but whenever you build a complex ACL setup, you should check and validate its impact to the performance.

  • Wildcard ACLs can be time consuming, thus make them as specific as possible.
  • The ACL inheritance is likely to affect deep trees with lots of ACL nodes on higher-level nodes.

Maintainability

But the bigger issue is always the maintenance of these permissions. If the setup is not well documented, debugging a case of misguided permissions can be a daunting task, even if the root cause is as simple as someone being the member of the wrong group. Just imagine how hard it is for someone not familiar with the details of AEM permissions if the she needs to debug the situation. Especially if the original creator of this setup is no longer available to answer questions.

Some experiences I made of the last years:

  • Hiding the complexity of the site and system is good, but limiting the (read) view of a user only to the 10 pages she is supposed to manage is not required, it makes the setup overly complex without providing real value.
  • Limiting write access: I agree that not everyone should be able to modify every page. But limiting write access only to small parts of a page is often too much, because then the number of ACLs are going to explode.
  • Trust your users! But implement an approval process, which every activation needs to go through. And use the versions to restore an older version if something went wrong. Instead of locking down each and every individual piece (and then you still need the approval process …)
  • Educate and train your users! That’s one of the best investments you can make if you give your users all the training and background to make the best of the platform you provide to them. Then you can also avoid to lock down the environment for untrained users which are supposed to use the system.

Thus my advice to everyone who wants (or needs) to implement a complex permission setup: Is this complexity really required? Because this complexity is rarely hidden, but in the end something the project team will always hand-over it to the business.

Writing unittests for AEM (part 3): Mocking resources

I introduced SlingMocks in the recent blog posts (part 1, part 2), and till now we only covered some basics. Let’s dig deeper now and use one of its coolest features: Mocking resources.

Well, to be honest, we don’t mock resources. We do something much better: We build an in-memory structure which represent sling resources. We can even map that to an in-memory JCR-repository!

For this blog post I created a very simple Sling model, which supposed to build a classic navigation. For simplicity I omitted nearly all of the logic which makes up a production-ready navigation component, and want to add just the pages below the site root to it.

The relevant files in my github repo:

The interesting part is the unit test. This time I used the AemMocks library from wcm.io, because it provides all the magic of the SlingMocks, plus some AEM specific objects I would like to use. As the AemContext class inherits directly from the SlingContext class, we can just use it as a drop-in replacement.

Loading the test data from a JSON flle in the test resources

The setup method is very simple: It loads a JSON structure and populates a in-memory resource tree with it (adds everything below a resource “/content”). Using this approach there’s no longer the need to mock resources, properties/value maps, and mock the relation between these. This is all done by the SlingMocks/AemMocks framework.

Thus a testcase looks like this:

No boilerplate code anymore

This simple example shows how easy it can be to write unit tests. The test code is very concise and easy to read and understand. In my opinion the biggest issue when writing unit tests now is creating the test content 🙂

Some notes to the test content:

  • My personal style is to put the test content into the same structure than the java packages. This makes it very easy to find the testcontent and keeps the testcontent and the unittest together.
  • You can write the JSON manually. But you can also use CRXDE Lite to create some structure in your local AEM instance and then export it with “http://localhost:4502/content/myproject/testcontent.tidy.-1.json ” and paste it into the JSON file. I find this way much more convenient, especially if you already have most of the test content ready. But if you do that, please clean up the unnecessary properties in the test data. They are polluting the test content and make it harder to understand.

Writing unit tests for AEM (part 2): Maven Setup

In the previous post I showed you how easy it is to start using SlingContext. If you start to use this approach in your own project and just copy the code, your Maven build is likely to fail with messages like this:

[ERROR] testActivate_unconfiguredParams(de.joerghoh.cqdump.samples.unittest.ReplicationServletTest)  Time elapsed: 0.01 s  <<< ERROR!
org.apache.sling.testing.mock.osgi.NoScrMetadataException: No OSGi SCR metadata found for class de.joerghoh.cqdump.samples.unittest.ReplicationServlet at org.apache.sling.testing.mock.osgi.OsgiServiceUtil.injectServices(OsgiServiceUtil.java:381)

The problem here is not the use of the SlingMock library itself, but rather the fact that we use SlingMocks to test code which uses OSGI annotations. The fix itself is quite straight-forward: We need to create OSGI metadata also for the unittests and not only for the bundling (SlingMock is reading these metadata for the test execution).

That’s the reason why there’s the need to have a dedicated execution definition for the maven-bundle-plugin:

https://github.com/joerghoh/unittest-demos/blob/master/core/pom.xml#L35

Also you need to instruct the maven-bundle-plugin to actually export the generated metadata to the filesystem using the “<exportScr> statements; this should be the default in my opinion, but you need to specify it explicitly. Also don’t forget to add the “_dsannotation” and “_metatypeannotations” statements to its instruction section:

https://github.com/joerghoh/unittest-demos/blob/master/core/pom.xml#L43

And even then it will fail, if you don’t upgrade the maven-bundle-plugin to a version later than 4.0:

https://github.com/joerghoh/unittest-demos/blob/master/pom.xml

Ok, if you adapted your POMs in this way, your SlingMock based unittests for OSGI r6-based services should run fine. Now you can start exploring more features of SlingMocks.

Writing unit tests for AEM — using SlingMocks

Over the course of the last years the tooling for AEM development improved a lot. While some years ago there was hardly an IDE integration available, today we have dedicated tools for Eclipse and IntelliJ (https://helpx.adobe.com/experience-manager/6-4/sites/developing/using/aem-eclipse.html). Also packaging and validation support was poor, today we have the opensourced filevault and tools like oakpal (thanks Marc!)

But with SlingMocks we also have much better unittest tooling (thanks a lot to Stefan Seifert and the Sling people), so we much more and better/easier tooling than just Mockito and Powermock. SlingContext (and its extension AemContext) allows you create unit tests quite easily. Using them can help you get rid of mocking Sling Resources, repo access and many things more.

On top of that, SlingMock can easily work with the new OSGI r6 annotations, which allow you to define OSGI properties in Pojos. Mocking the @ObjectClassDefinition classes isn’t that easy, because they are essentially annotations …

To illustrate that, I have created a minimal demo with a servlet and testcases for it (source code at Github). Basically the functionality of the class itself is not relevant for this article, but we want to focus on aspects how you utilize the frameworks best to avoid boilerplate code.

The code is quite simple, but to make unittesting a bit more challenging, it uses the new OSGI r6 annotations for OSGI configuration (using the @Designate and the @ObjectClassDefinition annotations) plus a referenced service.

https://github.com/joerghoh/unittest-demos/blob/master/core/src/main/java/de/joerghoh/cqdump/samples/unittest/ReplicationServlet.java#L29

If you try to mock the ReplicationServlet.Config class the naive way, you will find out, that it’s an annotation, which is referenced in the activate() method. I always failed to mock it somehow, so I switched gear and started to use SlingMock for it. (I don’t want to say that it is not possible, but it’s definitly not straight-foward, and in my opinion writing unit-tests should be straight forward, otherwise they are not created at all.)

With SlingMocks the approach changes. I am not required to create mocks, but SlingMocks provides a mocked OSGI runtime we can use. That means, that we create the OSGI parameters as a map to tell the SlingContext object to register our service with these parameters (line 59).

https://github.com/joerghoh/unittest-demos/blob/master/core/src/main/java/de/joerghoh/cqdump/samples/unittest/ReplicationServlet.java

Because SlingContext implements quite a bit of the OSGI semantics, it also requires that all referenced services are available (if these are static references). Therefor I use Mockito to mock the Replicator and I register the mock to provide the Replicator service. In realworld I could verify the interactions of my servlet with that mock.

This basic example illustrates how you can use SlingMocks to avoid a lot of mocking and stubbing. This example does not utilize the full power of SlingMocks yet, we are just scratching at the surface. But we already have some benefit : If you switch from SCR annotations to OSGI annotations, your SlingMock unittests don’t need any change, because it provides an OSGI-like environment, and there the way how metatypes are generated and injected are abstracted away.

From SCR annotations to OSGI annotations

Since the beginning of AEM development we used annotations to declare OSGI services; @Component, @Service, @Property and @Reference should be known to everyone how has ever developed backend stuff for AEM. The implementation behind these annotations came from the Apache Felix project, and they were called the SCR annotations (SCR = Service Component Runtime). But unlike the Service Component Runtime, which is part of the OSGI standard for quite some, these annotations were not standardized. This changed with OSGI Release 6.

With this release annotations were also standardized, but they are 100% compatible to the SCR annotations. And there are a lot of resources out there, which can help to explain the differences:

I recently worked on migrating a lot of the code from ACS AEM Commons from SCR annotations to OSGI annotations, and I want to share some learning I gained on the way. Because in some subtle areas the conversion isn’t that easy.

Mixed use of SCR annotations and OSGI annotations

You can mix SCR annotations and OSGI annotations in a project, you don’t need to migrate them all at once. But you can to be consistent on a class level, you cannot mix SCR and OSGI annotations in a single class. This is achieved by an extension to the maven-bundle-plugin (see below).

Migrating properties

SCR property annotations give you a lot of freedom. You can annotate them on top of the class (using the @Properties annotation as container with nested @Property annotations), you can annotate individual constant values to be properties. You can make them visible in the OSGI webconsole (technically you are creating a metatype for them), or you can mark them as private (not metatype is created).

With OSGI annotations this is different.

  • Metatype properties are handled in the dedicated configuration class marked with @ObjectClassDefinition. They cannot be private.
  • Properties which are considered to be private are attached to the @Component annotation. They cannot be changed anymore.

A limitation from a backward compatibility point of view: With SCR annotations you are not limited in the naming of properties, next to characters often the “.” (dot) and the “-” (dash, minus) was used. With OSGI r6 annotations you can easily create a property with a “.” in it

String before_after() default "something";

will result in the property with the name “before.after”; but with OSGI r6 annotations you cannot create properties with a “-” in it. Only OSGI r7 (which is supported in AEM 6.4 onwards) supports it with a construct like this:

String before$_$after() default "something";

If you want to keep compatibility with AEM 6.3, expect the breakage of property names or you need to investigate in workarounds (see #1631 of ACS AEM Commons). But my recommendation is to avoid the use of the “-” in property names alltogether and harmonize this in your project.

Update: I posted an additional blog post specifically on migrating SCR properties, mostly in the context of OSGI DS and OSGI Metatypes.

Labels & description

All the metatype stuff (that means, how OSGI configurations appear in the /system/console/configMgr view) is handled on the level of the @ObjectClassDefinition annotation and the method annotated with it. With the SCR annotations this was all mixed up between the @Component annotation and the @Property fields.

Update the tooling to make it work

If you want to work with OSGI annotations, you should update some elements in your POM as well:

  • Update the maven-bundle-plugin to 4.1.0
  • Remove the dependency to the maven-scr-plugin
  • Add a dependency to org.osgi:org.osgi.annotations:6.0.0 to your POM.
  • Then you need to add an additional execution to your maven-bundle-plugin (it’s called “generate-scr-metadata-for-unittests“) and update its configuration (see it on ACS AEM Commons POM).

The interesting part is here is the plugin to the maven-bundle-plugin, which can also handle SCR annotations; this statement allows you to mix both types of annotations.

This blog post should have given you some hints how you migrate the SCR annotations of an existing codebase to OSGI annotations. It’s definitly not a hard task, but some details can be tricky. Therefor it’s cool if you have the chance to mix both types of annotations, so you don’t need a big-bang migration for this.