Archive for July, 2009

Traversing the content hierarchy

July 28, 2009

When you play around with websites, you often get a good feeling how much “engineering” work and thoughts are put into the site. Think of things like SQL injection and shell escape injection, which were a problem a few years ago. If you encounter today a site which is vulnerable to such a problem, it’s either a problem of the budget (which is a lame excuse because modern frameworks avoid such problems) or the skill of the developer. An up-to-date site isn’t affected anymore by such problems.

A problem, which is clearly visible, but often not known to application developers and architects is the content structure, which is exposed to the user by the URL. Consider the following small example for a simple content structure:

/content/brand/en/home

for the startpage of a CQ-based website. The “home”-handle is the startpage and is thus called via

www.example.com/content/brand/en/home.html

So, where’s the problem? Well, most templates provide a kind of HTML representation of their content. So let’s try

www.example.com/content/brand/en.html

maybe also a structure handle such as the language node (which is often just used to differentiate between languages) does also provide a HTML representation of its content; so it could just render its child nodes as a dotted list.

So, what’s then? Does it harm, if you reveal, that you provide beside english also chinese content? No, it doesn’t. Most times it doesn’t. But when you already have fresh content ready, but not yet linked? It would appear in such a list. Or if you have “hidden content”, functions which are known only to a small group of people? Things, which aren’t secured by authentication and authorization. Suddenly someone has found your private data and could make use of it.

The trash can for functionality is often a folder named “tools”; developers tend to place everything there which doesn’t fit well into any other category. So you can find there contact forms, search functionality and other stuff. So what happens if you call

www.example.com/content/brand/en/home/tools.html

Does it also your show unused/crappy/new functions, which aren’t used in the website, but are still there? Because for convenience some developer thought, it would be cool to have all tools listed without major hassle (1 bookmark instead of 10). Bad idea, you just showed all your available tools to someone, who shouldn’t see them.

So check you your site, that strucutre nodes, which are only used to structure your content, cannot be rendered at all or don’t reveal any information, which could be useful for an attacker. Either return an empty page or (suggested) return the HTTP statuscode “403″ (access denied). Don’t reveal data when it isn’t necessary. A well-engineered site also takes care of such “attacks” and doesn’t reveal any data which could be of use for a potential attacker.

I’ve already done such tests on several CQ-based websites and found (beside some other things) a monitoring page (containing version information of used libraries) and also a hidden webspecial which was dedicated to a member of the webteam, heading for another location (hi, Katrin!). All of these information were public viewable (on a major corporate website!) just by playing around with path names and following then links.

Disk usage

July 6, 2009

Providing enough free disk space is one of the most urgent task for every system administrator. Applications often behave strange when the disks runs out of free space and sometimes data gets corrupt if the application is badly written.

Under Unix the 2 tools to determine the usage of the disk are “du” (disk usage) and “df” (disk free). With “du” you can determine the amount of space a certain directory (with the files and subdirectories in it) consume. “df” shows on a global level the amount of used and free disk space per partition.

Given the case, that you give the same directory (which is also a mountpoint) to both “du” and “df” as parameters. This directory contains a full CQ application with content and versions. You will probably get different results. When we did it, “df” showed about 570 gigabyte used disk space, but “du” claimed, that only 480 gigabyte are used. We checked for hidden directories, open files, snapshots and other things, but the difference of about 90 gigabyte remained.

This phenomenon can be explained quite easy. “du” accumulates the size of files. So if a file has 120 bytes in size, it adds 120 to the sum. “df” behaves differently, it counts block-wise, which are the smallest allocation unit of a unix filesystem (today most blocks are 512 bytes large by default). Because the only one file per block is possible, the 120-byte file uses a full block, leaving 392 bytes unused in that block.

Usually this behaviour is not apparant, because the number of files is usually rather small (a few to some ten thousand) and they are large, so the unused parts are at max 1 percent of the whole file size. But if you have a CQ contentbus with several hundert thousands of files (content + versions) with a lot of small files, this part can grow to a level, where you’d ask your system administrator, where the storage is gone.

So dear system administrator, there’s no need to move your private backup off the server, just tell the business, that their unique content structure needs a lot of additional disk space. :-)