Discussion and considerations

Relational databases provide an effective and readily available means to store data. Once established, they normally behave consistently and reliably, making them easier to use than file systems. And clearly a database can do much more than a simple file system!

Efficiency can quickly become an issue, both in relation to how often requests are made to a database, and how long queries take. One way to offset the cost of database queries is to use a cache at some stage in the processing. Using a cache is a technique that crops up frequently in this book, and is detailed in Chapter 6,Caches and Handlers.

Whatever the framework does, a major factor will always be the care developers of extensions take over the design of table structures and software; the construction of SQL can also make a big difference. Examples included here have been assiduously optimized so far as the author is capable, although suggestions for further improvement are always welcome!

Web applications are typically much less mature than more traditional data processing systems. This stems from factors such as speed of development and deployment. Also, techniques that are effective for programs that run for a relatively long time do not make sense for the brief processing that is applied to a single website request. For example, although PHP allows persistent database connections, thereby reducing the cost of making a fresh connection for each request, it is generally considered unwise to use this option because it is liable to create large numbers of dormant processes, and slow down database operations excessively.

Likewise, prepared statements have advantages for performance and possibly security but are more laborious to implement. So, the advantages are diluted in a situation where a statement cannot be used more than once. Perhaps, even more than performance, security is an issue for web development, and there are well known routes for attacking databases. They need to be carefully blocked.

The primary goal of a framework is to make further development easy. Writing web software frequently involves the same patterns of database access, and a framework can help a lot by implementing methods at a higher level than the basic PHP database access functions.

In an ideal world, an object-oriented system is developed entirely on the basis of OO principles. But if no attention is paid to how the objects will be stored, problems arise. An object database has obvious appeal, but for a variety of reasons, such databases are not widely used. Web applications have to be pragmatic, and so the aim pursued here is the creation of database designs that occasionally ignore strict relational principles, and objects that are sometimes simpler than idealized designs might suggest. The benefit of making these compromises is that it becomes practical to achieve a useful correspondence between database rows and PHP objects.

It is possible that PHP Data Objects (PDO) will become very important in this area, but it is a relatively new development. Use of PDO is likely to pick up gradually as it becomes more commonly found in typical web hosting, and as developers get to understand what it can offer. For the time being, the safest approach seems to be for the framework to provide classes on which effective data objects can be built. A great deal can be achieved using this technique, as will soon be illustrated.

Database dependency

Lest this section create too much disappointment, let me say at the outset that this book does not provide any help with achieving database independence. The best that can be done here is to explain why not, and what can be done to limit dependency.

Nowadays, the most popular kind of database employs the relational model. All relational database systems implement the same theoretical principles, and even use more or less the same structured query language. People use products from different vendors for an immense variety of reasons, some better than others. For web development, MySQL is very widely available, although PostgreSQL is another highly regarded database system that is available without cost. There are a number of well-known proprietary systems, and existing databases often contain valuable information, which motivates attempts to link them into CMS implementations.

In this situation, there are frequent requests for web software to become database independent. There are, sadly, practical obstacles towards achieving this.

It is conceptually simple to provide the mechanics of access to a variety of different database systems, although the work involved is laborious. The result can be cumbersome, too. But the biggest problem is that SQL statements are inclined to vary across different systems.

It is easy in theory to assert that only the common core of SQL that works on all database systems should be used. The serious obstacle here is that very few developers are knowledgeable about what comprises the common core. The majority of books and training courses on SQL are about a particular database system. ANSI SQL might be thought to provide a system neutral language, but then not all of ANSI SQL is implemented by every system. So, the fact is that developers become expert in one particular database system, or at best a handful.

Skilled developers are conscious of the standardization issue, and where there is a choice, they will prefer to write according to standards. For example, it is better to write:

 SELECT username, userid, count(userid) AS number FROM aliro_session AS s INNER JOIN aliro_session_data AS d ON s.session_id = d.session_id WHERE isadmin = 0 GROUP BY userid

rather than,

 SELECT username, userid, count(userid) AS number FROM aliro_session AS s, aliro_session_data AS d WHERE s.session_id = d.session_id AND isadmin = 0 GROUP BY userid

This is because it makes the nature of the query clearer, and also because it is less vulnerable to detailed syntax variations across database systems.

Use of extensions that are only available in some database systems is a major problem for query standardization. Again, it is easy while theorizing to deplore the use of non-standard extensions. In practice, some of them are so tempting that few developers resist them.

An older MySQL extension was the REPLACE command, which would either insert or update data depending on whether a matching key was already present in the database. This is now discouraged on the grounds that it achieved its result by deleting any matching data before doing an insertion. This can have adverse effects on linked foreign keys but the newer option of the INSERT ... ON DUPLICATE KEY construction provides a very neat, efficient way to handle the case where data needs to go into the database allowing for what is already there. It is more efficient in every way than trying to read before choosing between INSERT and UPDATE, and also avoids the issue of needing a transaction.

Similarly, there is no standard way to obtain a slice of a result set, for example starting with the eleventh item, and comprising the next ten items. Yet this is exactly the operation that is needed to efficiently populate the second page of a list of items, ten per page. The MySQL extension that offers LIMIT and LIMITSTART is ideal for this purpose.

Because of these practical issues, independence of database systems remains a desirable goal that is rarely fully achieved. The most practical policy seems to avoid dependencies where this is possible at reasonable cost.

The role of the database

We already noted that a database can be thought of as uncontrolled global data, assuming the database connection is generally available. So there should be policies on database access to prevent this becoming a liability.

One policy adopted by Aliro is to use two distinct databases. The "core" database is reserved for tables that are needed by the basic framework of the CMS. Other tables, including those created by extensions to the CMS framework, use the "general" database.

Although it is difficult to enforce restrictions, one policy that is immediately attractive is that the core database should never be accessed by extensions. How data is stored is an implementation matter for the various classes that make up the framework, and a selection of public methods should make up the public interface. Confining access to those public methods that constitute the API for the framework leaves open the possibility of development of the internal mechanisms with little or no change to the API. If the framework does not provide the information needed by extensions, then its API needs further development. The solution should not be direct access to the core database.

Much the same applies to the general database, except that it may contain tables that are intended to be part of an API. By and large, extensions should restrict their database operations to their own tables, and provide object methods to implement interfaces across extensions. This is especially so for write operations, but should usually apply to all database operations.

Level of database abstraction

There have been some clues earlier in this chapter, but it is worth squarely addressing the question of how far the CMS database classes should go in insulating other classes from the database.

All of the discussions here are based on the idea that currently the best available style of development is object oriented. But we have already decided that using a true object database is not usually a practical option for web development. The next option to consider is building a layer to provide an object-relational transformation, so that outside of the database classes, nobody needs to deal with purely relational concepts or with SQL. An example of a framework that does this is Propel, which can be found at http://propel.phpdb.org/trac/.

While developments of this kind are interesting and attractive in principle, I am not convinced that they provide an acceptable level of performance and flexibility for current CMS developments. There can be severe overheads on object-relational operations and manual intervention is likely to be necessary if high performance is a goal. For that reason, it seems that for some while yet, CMS developments will be based on more direct use of a relational database.

Another complicating factor is the limitations of PHP in respect of static methods, which are obliged to operate within the environment of the class in which they are declared, irrespective of the class that was invoked in the call. This constraint is lifted in PHP 5.3 but at the time of writing, reliance on PHP 5.3 would be premature, software that has not yet found its way into most stable software distributions. With more flexibility in the use of static methods and properties, it would be possible to create a better framework of database-related properties.

Given what is currently practical, and given experience of what is actually useful in the development of applications to run within a CMS framework, the realistic goals are as follows:

  • To create a database object that connects, possibly through a choice of different connectors, to a particular database and provides the ability to run SQL queries
  • To enable the creation of objects that correspond to database rows and have the ability to load themselves with data or to store themselves in the database

Some operations, such as the update of a single row, are best achieved through the use of a database row object. Others, such as deletion, are often applied to a number of rows, chosen from a list by the user, and are best effected through a SQL query.

Towards the end of this chapter, the automatic creation of HTML is mentioned briefly. Such a thing is possible, and you can obtain powerful code for achieving it by downloading the full Aliro project. Unfortunately, experience in use has been disappointing. Often, so much customization of the automated code is required that the gains are nullified, and the automation becomes just an overhead. This topic is therefore given little emphasis.

Ease of development

Let's think once more of the typical pattern that occurs time and time again in web applications: a list of objects, some of whose properties are displayed through the browser, with page control. This is most effectively achieved by running a query to find out for page control purposes the total number of items to be displayed, and then running a second query to find the items needed for display on a particular page.

A CMS framework should provide an easy means to carry out operations of this kind. Finding the total number of items should be achievable in one or two method calls, and getting back a set of objects is so common an operation that it should be achievable in a single method call. By default, the simple PHP stdClass is acceptable for the return of rows, but being able to return the data as arbitrary objects makes the data handling much more powerful, since the developer is then able to create data objects with behaviors that model the problem domain.

A simple but significant issue is being able to prefix table names in the database so as to avoid clashes between systems sharing the same database. The framework should be designed to handle this.

Major gains can be made if the simple properties of data objects are determined from the database instead of having to be coded. This cuts down on coding, and avoids a source of errors and inconsistencies. Going further, data objects can have the SQL for basic operations generated automatically, reducing development effort and eliminating a source of errors.

The CMS framework can provide an acceptable solution to the problem of keeping certain items in order. For example, a menu has a particular order that may be altered by a site administrator. Relational databases do not handle an imposed ordering easily. They typically work on the basis that order arises naturally out of the data, and can be achieved by sorting. This is true, but when a specific ordering is imposed by numbering, say menu items, repeated changes lead to a situation where the numbers need to be resequenced. There is no direct way to do this using a relational database because all operations are done on result sets rather than being done sequentially. A method of solving this problem with reasonable efficiency for all but the very largest sets of data is provided in the Sequencing database rows section towards the end of this chapter.

Keeping up with change

Very few systems are static, especially when they are web related. A lone site can be maintained by editing its various components. But a CMS framework is justified by situations where multiple sites use more or less the same software. Questions of efficient maintenance quickly arise.

The CMS architecture advocated here involves a minimal framework, with much of the functionality added as extensions to the basic framework. To make this easy, the framework includes an installer, and extensions have to be packaged in a particular way to support the work of the installer.

Inevitably, the time will come when a new version of an extension needs to be installed. The framework helps in this by providing three distinct functions such as installation, deletion, and upgrade. The upgrade procedure is exactly the same as installation except that a box is ticked to tell the installer to expect the extension to be already present. Whenever it handles an installation or upgrade, the installer invokes an optional piece of code in the extension, the install or upgrade code. These two may well have some very similar functions, which suggest that the best implementation could involve a common abstract parent class.

So one question we need to answer in this chapter is what database facilities are needed to make the task of an extension's upgrade code easier? The answer, in general, is to make it simple for the extension to obtain information about the current database. That involves providing methods for finding out whether a particular table is presently installed, and whether a table contains a particular field. It also involves being able to add a field if it does not exist, or altering its specification if it does.

While these functions could be implemented using more basic building blocks, they are sufficiently frequently needed that it makes sense for them to be a basic part of the CMS framework.

Database security

Websites get cracked. Databases are critical to CMS operation, and malicious exploits are often directed at CMS databases.

A CMS needs to connect to its database as a "user", who has quite extensive privileges. As we will discuss later in this chapter and also in Chapter 4, Administrators, Users, and Guests, some progress can be made to limit this vulnerability by providing the option of using two databases: one for the most critical information, and another for more general information. But in both cases, the connection to the database must have the ability to perform all possible actions within that database, otherwise, the concept of a CMS that can be managed through its own web interface is defeated.

As far as possible, the CMS should keep secret the credentials for connecting to the database. More can be achieved in this direction with a custom-built framework than one which is released for general use. In the general case, we know very little about the hosting environment and all the code is accessible to the public, including those with ill intent. This means that schemes such as encryption are of limited use, since a cracker may be able to use the CMS's own decryption routines to obtain access to hidden data.

The database itself should be secured by restricting where connections can come from. Usually, it is feasible to limit database access to one or more servers that are known to be serving web pages, and refuse access to any other client. Constraints of this kind are important, but outside the remit of the CMS.

The best known vulnerability that does fall within the scope of the CMS is SQL injection. This is easily illustrated by thinking of a community site where each user has a profile that can be accessed using that user's chosen name. We might create SQL like the following:

 $sql = "SELECT realname, profile FROM #_users WHERE username='$name' ";

The problem is that the PHP variable $name could have been taken directly from something typed into a browser. If it is then immediately placed into SQL, an ingenious cracker is able to pervert the SQL statement. Typing a name followed by a quote mark followed by some SQL is liable to cause a much modified SQL statement to be run. With knowledge of the system and a certain amount of ingenuity, it may be possible to create a situation where the SQL actually returns the password of a user rather than harmless profile information.

The only fully effective solution to this problem has to be implemented by the software that uses the CMS database methods. It is essential to carefully validate user input before putting it into SQL. Where quite specific validation conditions are relevant, such as a specific range of legal values, they should be applied rigidly.

Otherwise, one of the following two options should be taken:

  • A numeric field can be submitted to SQL without quotes, but should be validated to be strictly numeric before this is done.
  • Any non-numeric field must be enclosed in quotes, and the CMS should provide an "escape" method that will process strings to make sure they can be legitimately enclosed in quotes to create a string that the database can handle. The CMS escape method should call a function or method provided by the database whenever possible. MySQL does provide this service.

In some circumstances, it is also important to check user data for semicolons, since they are SQL statement terminators. A malicious user may be able to include additional SQL by the use of semicolons. This does not arise in most of the database methods discussed here, since they will not accept multiple SQL statements; the result will simply be a database error.

Another kind of attack is to try to break the site. One way this might happen is when LIKE is used in SQL. Thinking again of the same community site, we might want to help people find a user profile by accepting a partial name, and then running a query something like:

 $sql = "SELECT username FROM #_users WHERE username LIKE ('$name%')";

If we have a very large number of users, we may be relying on there being an index on the username field to make the query run efficiently. But supposing the malicious user puts in a name that begins with a percent sign, the result is a search that cannot use the index, and may take much longer than expected. This particular case is possibly not so bad, but situations can be envisaged where the insertion of an unexpected percent sign would create a query that would effectively bring the database server to a standstill.

The general conclusion is that all database queries have to be written with an eye on how they might be damaged by a malicious user. There is no substitute for strict validation, and it is not always easy for developers to adopt the mindset of a cracker. Unfortunately, the world of the Internet is such that they have to!

Pragmatic error handling

Software developers have always been uncertain of how best to handle errors. The writers of database systems are no different, so they simply pass the buck back to the database users. Typically, erroneous SQL simply doesn't return any results. An error code will also be set, but if it is not checked by the caller, the database system does not care.

Application developers commonly veer between two extremes, sometimes within the same program. One extreme is to check after every database operation to see whether any error occurred, and then to report on an error, possibly terminating processing. The opposite extreme is to ignore the possibility of errors altogether, and just let the program proceed. This uncertainty on the part of developers is well illustrated by a comment buried in the code of a very popular CMS that says "This is an error but not sure what to do with it ... we'll still work for now"!

A fairly common experience with websites is to attempt some apparently straightforward operation only to receive a database error message that is completely meaningless to you, as an outsider to the system. Because this is so pointless and even irritating to the user, a lot of development is done using the second choice, ignoring all errors. This makes the code much cleaner as checking after every SQL operation is cumbersome. In a surprising number of cases the result is not especially unsatisfactory for the user, and in others the website malfunctions in some arbitrary way.

It is this uncertainty that is worrying, and causes developers to at least feel guilty about not handling errors. A further complicating factor is that detailed error messages shown to a website visitor may be useful to a cracker looking to understand the structure of the database for illicit purposes. This is less relevant for open source, since most of the information is public in any case, but could be an important consideration for a system that is closed source, or a customized version of open source.

The best solution to this dilemma seems to be to check every database operation, and to use the PHP5 exception handling mechanisms to deal with any errors. Following this policy, when an error is encountered, an exception is always thrown. This is usually picked up at quite a general level, and a good deal of information can be recorded into an error table in the database. This includes information about the error, the SQL itself, and a trace of the calls that led to the running of the SQL

statement. The latter is vital since very similar SQL statements may result from quite different processing, or for that matter, the person looking to resolve the problem may not be familiar with all the SQL statements in the system. A general error message is shown to the user, and the rest of the processing is stopped. Telling the user only that a database query has failed is significant for security as just discussed. The fact of an error at a particular time and date is enough for the site manager to consult the error log, and find out the details.

This approach provides the user with adequate information that a failure has occurred, and is generally a safe option for preventing further damage to the site. For the site administrator, extensive debugging information is available so that the fault can be diagnosed, and corrected. In fact, the details of the error that can be recorded into a logging system will generally be more complete and more accurate than problem reports from site visitors.

If a developer chooses to do more to handle errors, this is still possible. Since all that the standard database methods do is to create an exception, the developer is free to catch exceptions occurring in particular sections of code instead of leaving the framework to pick them up at a general level.