“Concurrent users” and performance tests

A few years ago when I was still working in application management of a large website we often had the case, that the system was reported to be slow. Whenever we looked into the system with our tooling we did not found anything useful, and we weren’t able to explain this slowness. We had logs which confirmed the slowness, but no apparent reason for it. Sadly the definition of performance metrics was just … well, forgotten. (I once saw the performance requirement section in the huge functional specification doc: 3 sentences vs 200 pages of other requirements.)

It was a pretty large system and rumors reported, that it was some of the biggest installations of this application. So we approached the vendor and asked how many parallel users are supported on the system. Their standard answer “30” (btw: that’s still the number on their website, although they have rewritten the software from scratch since then) wasn’t that satisfying, because they didn’t provide any way to actually measure this value on the production system.

The situation improved then a bit, got worse again, improved, … and so on. We had some escalations in the meantime and also ran over months in task force mode to fight this and other performance problems. Until I finally got mad, because we weren’t able to actually measure the how the system was used. Then I started to define the meaning of “concurrent users” for myself: “2 users are considered concurrent users, when for each one a request is logged in the same timeframe of 15 minutes”. I wrote a small perl script, which ran through the web server logs and calculated these numbers for me. As a result I had 4 numbers of concurrent users per hour. By far not exact, but reasonable to an extent, that we had some numbers.

And at that time I also learned, that managers are not interested in the definitions of a term, as long as they think they know what it means. So actually I counted a user as concurrent, when she once logged in and then had a auto-refresh of a page every 10 minutes configured in her web browser. But hey, no one questioned my definition and I think the scripts with that built-in definition are still used today.

But now we were able to actually compare the reported performance problems against these numbers. And we found out, that it was related sometimes (my reports showed that we had 60 concurrent users while the system was reported to be slow), but often not (no performance problems are reported but my reports show 80 concurrent users; and also performance problems with 30 reported users). So, this definition was actually useless… Or maybe the performance isn’t related to the “concurrent users” at all? I alread suspected that, but wasn’t able to dig deeper and improve the definition and the scripts.

(Oh, before you think “count the active sessions on the system, dude!”: That system didn’t have any server side state, therefor no sessions. And the case of the above mentioned auto-reload of a the startpage of a logged in user will result in the same result: She has an active session. So be careful.)

So, whenever you get a definition of “the system should support X concurrent users with a maximum response time of 2 seconds”, question “concurrent”. Define the activities of these users, build according performance tests and validate the performance. Have tools to actually measure the metrics of a system hammered with “X concurrent users” while these performance tests. Apply the same tooling to the production. If the metrics deliver the same values: cool, your tests were good. If not: Something’s wrong: either your tests or the reality…

So as a bottom line: Non-functional requirements such as performance should be defined with the same attention as functional requirements. In any other case you will run into problems. And

Advertisements