Caching Discussion

Introduction

First of all it is worth asking exactly why we might want to use caching. The basic answer is to achieve a speed increase and thus a performance increase (more hits per second or concurrent hits). In a system like Paloose (and Tomcat/Cocoon) performance increase can also come from alternate page serving techniques. In these, a static page server such as Apache is used (and essential) to take the burden from the more dynamic pages with which Paloose deals. Thus images are a good candidate for bypassing Paloose and being served directly from the Apache server.

So what are characteristics of a dynamic page over a static page? Simplistically they are:

resources that change (a database query for example), or
resources that are dependent on user input, or
resources that are made up from fragments (transformed from XML documents)

Images (jpeg, png files etc) are static (in general) because they do not require any form of modification due to user input or database queries, and thus can be safely served by the Apache server.

Anything, therefore, that can speed up the process of transforming user data and other inputted variables has to be good. As a further complication, because of the stateless, on-demand nature of the servers there is also the question of the actual server code. In Paloose this is made worse by using a language such as PHP5 which is an interpreted language. Cocoon and Tomcat are a compiled into intermediate code language (not wholly always accurate but for the purposes of the argument true). Apache is a compiled solution and so is running without these restrictions. In an interpreted language such as PHP5 the code has to be translated each time into a runnable form. In modern systems there is a natural caching (persistence) which helps with this process: frequently used code is kept in memory for use next time. However, it is impossible to rely on this being there all the time.

Caching Code

Caching the code is the primary way of overcoming the problems of interpreters. With Paloose this would clearly be possible, although I have not tried this it remains a potential option for future work. I have shown elsewhere that by judicious control of the basic Paloose code (rather than caching it) some considerable performance increase can be gained. However this is at the cost of code clarity (no comments) and missing functionality (no logging). While the former may be acceptable the latter probably is not.

Caching the Sitemap

Paloose works by interpreting the sitemap and building an internal structure of Paloose components and pipelines representing the sitemap. Unfortunately this is done each time a request is made giving a substantial performance penalty. One solution that I tried was to precompile the sitemap into a PHP5 representation which is then run (and can be cached). Curiously the increase in performance was not as much as compared to the Paloose code. One advantage of this technique is that it is not dependant on user input or changing XML pages or database queries.

Caching the Page Components

Within the pipelines, components take a variety of resources to make up the final deliverable page. These inputs are various:

the input query from the user which consists of the requested resource (the page) and the query string of parameters.
data as a results of data base queries.
the XML or text fragments that make up the page.

The pipeline can be considered to be a state machine that outputs data dependant directly on the input. Unlike most (useful) state machines it has no persistent state between requests. Each request is considered to be fresh. The server/client arrangement with Web pages gets over this by using cookies. However this solution is not available in all cases (not everyone has cookies enabled). Thus we need to characterise a request purely on the basis on the input conditions for that request.

On top of the problems of changing inputs, the state of the server depends on:

The sitemap (has it changed since the last request?) — We do not need to check this as a change her will cause all the other conditions to fail. If they do not then the cached data can be safely used.
The XML fragments (have they changed?) — The most efficient way of achieving this is to note the latest modification time of the XML file and compare it with the previous one. However the previous time will have to be held in between requests in some form. However this is not the complete story for pipeline elements in the sitemap that do not take and external file (a transformer for example). In this case we need to inspect the inputted DOM, a little more tricky.
The XSL transformation file (has it been changed?) — Again the most efficient was of doing this is via a timestamp.
The query string submitted with the request (how have these parameters changed?) — we cannot use a timestamp here and so some form of hash is required.
Response from an SQL data base query (how may the results have changed?) — things that are dependant on these types of external queries obviously cannot be cached suitably as they are outside control.

Checks and Balances

Adding a caching system causes extra code to be introduced. This extra code can offset the advantages that might be gained by using a cache system. So careful testing should be used when deciding to use a cache system.

Data caches in the Paloose pipeline might (and I only say might) be of benefit where external influences are greatest: for example if the XSL transformations are particularly large or there are many of them.

Warning

One final warning. Because the caching mechanism does not pick up all the changes in the source files (for example included XML or XSL files) it is important to turn off the caching when developing Paloose sites. It is very easy to make changes and not see them reflected in the HTML output. I have learned this the hard way. I also suggest that when a change has been made to a live site that the cache be cleared to allow the new changes to filter through.