Why is it hard to do disk size estimations upfront?

Whenever a new project starts, a project manager is responsible to do a project sizing, so that the right amount of people with the right skills are assigned to the project. In many projects another early task is to size the hardware. This has mostly to do with the time to buy and deploy new hardware, which can be pretty long. On one project I did it took the IT 10 months (!!) from the decision to buy 8 of “these” boxes until they have been able to login on that box. And by the way, this was the regular hardware purchasing process with no special cases …

Anyway, even if takes only 6 weeks for the whole “new hardware purchase and deployment” process, you cannot just start and determine then what hardware is needed. When development starts, a development infrastructure must be provided, consisting of a reasonable amount of systems with enough available resources. So one of the earliest project tasks is an initial system sizing.

If you have done it a few times for some specific types of projects (for example CQ5 projects) you can give some basic sizing, without doing major calculations; at that time you usually doesn’t have enough information to do  a calculation at all. So for a centralized development system (that’s where the continous integration server deploys to) my usual recommendation is “4 CPUs, 8-12G RAM, 100G free disk; this for 1 author and 1 publish”. This is a reasonable system to actually run development on. (Remember, that each developer has on her laptop also an authoring and publishing system deployed, where they actually try out their code. On these central developement systems all development tests are executed, as well as some integration tests.)

This gets much harder, if we talk about higher environments like staging/pre-production/integration/test (or however you might call it) and — of course — production. Because there we have much more content available. This content is the most variable factor in all calculations, because in most requirement documents it is not clear, how much content will be in the system in the end, how much assets will be uploaded, how often they are changed, and when they will expire and can be removed. To be honest, I would not trust any number given in such a document, because this usually changes over time and even during the very first phase of the project. So you need to be flexible regarding content and also regarding the disk space calculation.

My colleague Jayan Kandathil posted a calculation for the storage consumption of images in the datastore. It’s an interesting formula, which might be true (I haven’t validated the values), but I usually do not rely on such formulas because:

  • We do not only upload images to DAM, and besides the datastore we also have the TarPM and the Lucene index which contribute to the overall repository growth.
  • I don’t know if there will be adjustments to the “Asset update” workflow, especially if more/less/changed renditions will be created. With CQ 5.5 also any change of asset metadata will affect the datastore (XMP writeback changes the asset binary! This results in a new file in the datastore!).
  • I don’t know if I can rely on the numbers given in the requirements document.
  • There is a lot of other content in the repository, which I usually cannot estimate upfront. So the datastore consumption of the images is only a small fraction of the overall disk space consumption of the repository.

So instead of calculating the disk size based only on assumptions, I usually tell a disk size to start with. This number is soo high, that they won’t fill it up within the first 2-3 months. But it is also not that large, that they will never ever reach 100% of its size. It’s somewhat in between. So they can go live with it and they need to monitor it. Whenever they reach 85% of the disk size, IT has to add more disk space. If you run this process for some time, you can do a pretty good forecast on the repository growth and react accordingly by attaching more disk space. I cannot do this forecast upfront, because I don’t have any reliable numbers.

So, my learning from this: I don’t spend that much time in disk calculations upfront. I only give the customer a model, and based on this model they can react and attach storage in a timely manner. Also this is the most cheapest version, because you attach storage only when it’s really needed and not based on some unreliable calculation.