If you search this blog, you find one recurring theme over the years: The lifecycle of JCR sessions and Sling ResourceResolvers. That you should not keep them open for a long time. And that you definitely have to close them. But I never gave you an example what can happen if you don’t follow this recommendation. Until now.
These days I learned that was is actual problem which can arise because of it. And the problem is called “SegmentNotFoundException”.
In the past a SegmentNotFoundException was a clear indication of a corrupt JCR repository. The recommendation was always either to fix it or to restore from backup. Both operations are tedious, require downtimes and possibly also mean a loss of data. That’s probably also the reason why this specific problem is often taken for the sign of such a repository exception. So let’s systematically look at it.
The root cause
With AEM 6.4 the feature of “tail-compaction” was introduced, which is a version of the online compaction feature. It is less efficient but takes less time than the full compaction. By default in AEM the tail compaction runs daily and the online compaction once a week.
But from what I understood, this tail compaction has a problem with long running sessions, and it can happen, that tar files are compacted and removed, which are still referenced. That means, that it’s not really a on-disk corruption which needs to be fixed, but rather that some “old sessions” (read about MVCC in the previous post) are referencing data which is not there (anymore).
Validate the symptoms
This problem I describe in this post happens under some special circumstances, which you should check first before you start the hunt for long-running sessions:
- You get SegmentNotFoundExceptions (always with the same segment ID).
- A repository check doesn’t find any inconsistency.
- If you restart the instance, the error is gone, but appears again after some time (mostly at least a day).
- You are running AEM 6.4 or AEM 6.5 (SP doesn’t seem to matter).
In the case I observed, only a single workflow step was affected, but not all the time and only after some time, which made me believe that it was related to the compaction. But it was very hard to track down the error, because the workflow step itself was complex, but safe.
Fix any long-running session in your application (unless you are registering an ObservationListener in there, which takes care of the refreshs by design). Really all. Use the JMX webconsole plugin and check the list of registered session mbeans every day on a production instance. Count them. Look at the timestamps when the session was opened.
In the case I observed, the long running session was in a different area of the application, but was working on the same data (user profiles) as the failing workflow step. But the 2 areas in the code were totally unrelated to each other, so that was the only way to track it down.
Some other notes, which I consider as important in this context:
- When you encounter a SegmentNotFoundException, please always open a support ticket, just in case. If it’s a different issue than described here, it’s better if you have that ticket open already.
- If you see exactly this issue, and changing your application code makes this problem go away, please also raise a support ticket. That bug should get fixed (even if long-running sessions are not recommended since years).
- As mentioned, when you encounter this issue, it’s not a persisted corruption. Restarting will cause the issue not to appear for some time, but that should only buy you time to identify and fix the long running sessions.
- And AEM as a Cloud Service is not affected by this problem, because neither Online Compaction nor Tail Compaction are used. Instead the Golden Master is offline compacted before cloning.