Saturday, April 30, 2011

Databases:Chapter 10


Chapter 10

Databases





CONTENTS


Databases have been intimately connected with the World Wide Web and CGI ever since the inception of the Web. In fact, the Web itself is an immense worldwide database, a collection of data and resources accessible at the click of a mouse.
On a more mundane level, interaction with server-side databases is one of the most natural applications of CGI. The end user can submit a query through a form and have the results displayed directly back to his or her browser.
Because real-time interaction is not usually needed for database interaction, one of the major drawbacks of CGI (lack of persistent connection) is avoided. In addition, because only the results of the query are sent to the client, the size of the database does not factor greatly in the speed of the transaction. (Anyone who has used one of the Web searchers such as AltaVista or Lycos can attest to this.)
Although any kind of database can be accessed via CGI, there are a few types of databases that have become very popular on the Web:
  • Small Text Databases-These are the easiest to create and are useful for many small- to medium-sized tasks. These databases are simply ASCII files with delimited or fixed-length records. Address books, schedules, and other such databases with limited number of entries are suited well for this method.
  • Databases of the Web-A natural candidate for Web databases is the Web itself. Services that catalog large portions of the Web (along with Usenet, Gopher, and so on) are popping up with great frequency, as covered elsewhere in this book. This can also be applied to intranets and single machines. Indexing software like freeWais or Ice can create a database of an entire site, which can then be accessed through CGI programs.
  • Large Database Servers-Large databases are generally stored in a database server of some sort (even if the server is also a client, such as most common Windows-based databases). As long as the server has some method of interface with other programs, the information in these databases can be accessed from a CGI program.
In practice, Web-based databases can be some combination of these. Each database has its own needs, and CGI programs must often be customized to suit the needs of your particular database.

Database Interfaces

To effectively and seamlessly merge your database with the Web, CGIs must be used in both the front and back end of the database interaction.

Front End CGIs

The first thing that must be considered is how the user is going to enter queries to the database. An HTML form is the most common way for the user to submit information, although there are other ways. As an example, consider an interface to an address book. A simple form could look like this:
<HTML><HEAD><TITLE>My Address Book</title></head>
<BODY>
<H2>Welcome to my address book</h2>
To find addresses that match a certain category, fill in that category and
then press 'submit'.
<FORM ACTION="address.cgi" METHOD="POST">
Name: <INPUT SIZE=45 name="name"><br>
Phone: <INPUT SIZE=45 name="phone"><br>
Street Address: <INPUT SIZE=45 name="street"><BR>
City: <INPUT SIZE=20 name="city"> State: <INPUT SIZE=3 name="state">
Zip: <INPUT SIZE=6 name="zip"><br>
<INPUT TYPE=SUBMIT Value="  Submit Query  ">
<INPUT TYPE=RESET Value="  Reset Form  "><br>
</body></html>
This form calls the CGI program address.cgi. This is the front end to the database interaction. The purpose of the front end is to collect the data from the form, parse it, and somehow pass the query to the database. If the database is an ASCII text file, the front end is also the middle end and the back end. It must do the searching, interpret the data, and then pass the results back to client. For database servers (including Web indexes like freeWais and Ice), the front end must put the query into a form that the server will understand, and then pass the query to the server. A back end CGI program must then retrieve the results and pass them to the user. Very often in this case, the front and back ends are contained in the same program. On systems that support it (UNIX, Amiga, and others), this can be accomplished with process forking. In the Windows environment, special applications that take advantage of OLE or some other type of inter-application communication is necessary.
To go back to the address book example, we can now construct the CGI program that will answer the client's request.
First, we need to know the format of the database itself. A small text-based format is sufficient for our needs. We'll use delimited records, although fixed-length records would also work. An example record follows:
0:Elmer J. Fudd:555-1234:42 Jones Lane:Chuckville:CA:90210
This format will be familiar to anyone who has ever seen a UNIX password file. There are two drawbacks to this format. The total of all fields cannot exceed any line length limitations on whatever system you are using (in our case, this should not be a problem). Also, our delimiter (a colon) should not appear in any field, or it will look like the start of a new field. In an ideal world, this should not be a problem for us (unless someone lived in "New York: The City"). But in reality, people make typos or are just plain malicious. Therefore, we must be aware of this potential problem.
Now that we know the form of the database, we can begin the CGI program to gather the information from the form. Any language can be used to write CGIs, but in this example, we'll use Perl for its text-handling capabilities.
#!/bin/perl

require cgi_head; # Get form data and print header.

Tip
In all Perl CGIs in this chapter, a module called cgi_head.pm is used to gather the information from the form and print the required HTML header. This module places a form entry with name 'foo' into an associative array entry with name $FORM{'foo'}. There are several freely available programs for several languages to accomplish this, including CGI.pm for Perl at
http://www.perl.com/perl/CPAN/

Now that form data has been read in, we must read in the database itself. Because we are using a delimited database, it is easiest to read in the entire database. A fixed-length field database would enable us to move through the database without reading the entire thing, but that method has its own drawbacks (the most obvious being that the records must not exceed a fixed length). We read in the database as a flat ASCII file and parse it line by line using the handy Perl while(<FILEHANDLE>)<> construct.

Caution
This example required Perl 5.001 or above because of its use of references that were not included in Perl 4 (or earlier). Perl 5 contains many enhancements and new features and is a must for any Perlphile. It is available at
http://www.perl.com/perl/CPAN/

# First, open the database. (Which is called 'database.txt' here.)
open (DAT, "database.txt") || die "Can't open the database: $! !.\n";
$maxn = 0; # A counter for the number of entries.
while (<DAT>) {
    chop;
    @field = split(/:/); # Split the line into the data fields.
    $n = $field[0]; # First field is an id number
    $add[$n]{'name'} = $field[1]; # Then the name
    $add[$n]{'phone'} = $field[2]; # The phone number
    $add[$n]{'street'} = $field[3]; # The street address
    $add[$n]{'city'} = $field[4]; # The city
    $add[$n]{'state'} = $field[5]; # The state
    $add[$n]{'zip'} = $field[6]; # The Zip Code
}
$maxn = $n # Set the max number to the last entry
Now that the database has been loaded, we need to compare the user's query with the data, as shown in Listing 10.1.

Listing 10.1. Searching an ASCII database.
@results = (); # Zero out an array to hold the results.

if ($name = $FORM{'name'}) { # If the client wanted to search a name,
    for ($I = 0; $I <= $maxn; $I++) { # Go through each entry
        if ($name eq $add[$I]{'name'}) { # Looking for a match.
            push(@results,$I); # If one is found, add its id
        }               # Number to the list of results.
    }
    if (!@results) { &exitnone; }  # If no match is found, exit.
}

# Now repeat for each criteria.  If there are results from a previous
# match, search them instead, and remove any entries that don't match.

if (($phone = $FORM{'phone'}) && !@results) {
    for ($I = 0; $I <= $maxn; $I++) {
        if ($phone eq $add[$I]{'phone'}) {
            push(@results,$I);
        }
    }
    if (!@results) { &exitnone; }
} elsif ($phone = $FORM{'phone'}) {
    @r2 = @results;
    foreach $I (@r2) {
        if ($phone ne $add[$I]{'phone'}) {
            @results = grep(!/$I/,@results);
        }
    }
    if (!@results) { &exitnone; }
}

if (($street = $FORM{'street'}0 && !@results) {
    for ($I = 0; $I <= $maxn; $I++) {
        if ($street eq $add[$I]{'street'}) {
            push(@results,$I);
        }
    }
    if (!@results) { &exitnone; }
} elsif ($street = $FORM{'street'}) {
    @r2 = @results;
    foreach $I (@r2) {
        if ($street ne $add[$I]{'street'}) {
            @results = grep(!/$I/,@results);
        }
    }
    if (!@results) { &exitnone; }
}

if (($city = $FORM{'city'}) && !@results) {
    for ($I = 0; $I <= $maxn; $I++) {
        if ($city eq $add[$I]{'city'}) {
            push(@results,$I);
        }
    }
    if (!@results) { &exitnone; }
} elsif ($city = $FORM{'city'}) {
    @r2 = @results;
    foreach $I (@r2) {
        if ($city ne $add[$I]{'city'}) {
            @results = grep(!/$I/,@results);
        }
    }
    if (!@results) { &exitnone; }
}

if (($state = $FORM{'state'}) && !@results) {
    for ($I = 0; $I <= $maxn; $I++) {
        if ($state eq $add[$I]{'state'}) {
            push(@results,$I);
        }
    }
    if (!@results) { &exitnone; }
} elsif ($state = $FORM{'state'}) {
    @r2 = @results;
    foreach $I (@r2) {
        if ($state ne $add[$I]{'state'}) {
            @results = grep(!/$I/,@results);
        }
    }
    if (!@results) { &exitnone; }
}

if (($zip = $FORM{'zip'}) && !@results) {
    for ($I = 0; $I <= $maxn; $I++) {
        if ($zip eq $add[$I]{'zip'}) {
            push(@results,$I);
        }
    }
    if (!@results) { &exitnone; }
} elsif ($zip = $FORM{'zip'}) {
    @r2 = @results;
    foreach $I (@r2) {
        if ($zip ne $add[$I]{'zip'}) {
            @results = grep(!/$I/,@results);
        }
    }
    if (!@results) { &exitnone; }
}

At this point, either we have successful matches that are stored in the array @results, or we have no matches, in which case we call the &exitnone subroutine. Now we can give the client the results (or lack thereof).
# If there are no matches, print a note then die.

sub exitnone {
    print <<EOE;
<HTML><HEAD><TITLE>No matches</title></head>
<BODY>

<h3>There were no matches that fit your criteria.</h3>
<A HREF="addrbk.html">Go</a> back to the form to try again.
</body></html>
EOE

die;
}

# Print all the fields of each match.

print <<EOP;
<HTML><HEAD><TITLE>Search Results</title></head>
<BODY>
<h3>The entries that matched your search</h3>
<pre>
EOP

foreach $r (@results) {
    print <<EOG;

----
Name: $add[$r]{'name'}
Phone: $add[$r]{'phone'}
Address:
$add[$r]{'street'}
$add[$r]{'city'}, $add[$r]{'state'}  $add[$r]{'zip'}

EOG

}
print <<EOH;
</pre><br>
Thank you for using my address book.
<A HREF="addrbk.html">Go</a> back to the form to make another search.
</body></html>
EOH
Now we have a working front end to an address book. There are several optimizations that could be made, but it runs quite well for a few dozen lines. Note that this script, as is, only does a Boolean AND search on all fields. It would be possible to make it an OR search by removing all calls to &exitnone except for the last one. This way, when the program does not find any matches, it will not die but move on to the next field. It would also be possible to enable the end user to choose whether to do an AND or OR search by adding a pull-down menu to the form page. Then the CGI could exit or not depending on the choice.
Now that the user can search your database for any number of criteria; the next logical question is how to add or remove information to the database. You could, of course, do this by hand, but it would be advantageous to allow direct manipulation of the database from the Web itself. Fortunately, this is not hard.
In the manipulation of the database itself is where the difference between delimited and fixed-length record databases becomes important. With delimited text, you have no easy way of knowing where one record ends and another begins. Therefore, to change or delete one record it's necessary to rewrite the entire database. So in small databases, this is not really a big performance hit. If your database is large enough that this becomes a problem, it would probably be a good idea to look into using a database server.
With fixed-width field databases, however, it's not necessary to rewrite the entire database to change a record. Because the length of each record is known, functions like seek() and tell() (or their equivalent in your preferred language) can be used to write over only a portion of the file, changing or deleting records.

Caution
Enabling users to write to files is one of the most dangerous undertakings on the Web. Most Web servers are run as user 'nobody' (on systems that have distinct users). This means that the server has no special permissions to write to any file. To be accessible to a CGI script, a file must be world writeable, meaning that anyone with access to the server machine can modify the file in any way (including erasing it entirely). If you trust everyone on your machine (or you are the only user), this may not be a terrible problem. Because the name of the database file is not visible from the Web, you could hide it in some far-out directory with an unusual name, thereby providing "security by obscurity." There are other solutions, however. A module exists for Perl called CGIWrap (http://wwwcgi.umr.edu/~cgiwrap), and similar modules exist for other languages. CGIWrap will execute your CGI program "setuid owner." That is, the program runs as if it were executed by the user who owns the program. This allows you to remove write privileges for everyone but yourself. Be aware, however, that the program can now modify any file in your directory as if it were you. Therefore, it is wise to make very sure that your programs are secure. (For Perl users, try running the script with the -Tw switch.)

As with the front end to the searching CGI, a simple HTML form is all that is required to enable users to directly modify your database. Here again is our address book example:
<HTML><HEAD><TITLE>My Address Book</title></head>
<BODY>
<h4>Fill out the form below to add an entry to the address book</h4>
<FORM ACTION="add.cgi" METHOD="POST">
Name: <INPUT SIZE=45 NAME="name"><br>
Phone: <INPUT SIZE=45 NAME="phone"><br>
Street: <INPUT SIZE=45 NAME="street"><br>
City: <INPUT SIZE=20 NAME="city"> State: <INPUT SIZE=3 NAME="state">
Zip: <INPUT SIZE=6 NAME="zip">
<br><br>
<INPUT TYPE=SUBMIT VALUE="  Add Entry  ">
<INPUT TYPE=RESET VALUE="  Reset Form  ">
</form></body></html>
This form is almost identical to the one we made for searching. The difference comes in how the data is treated by the CGI program. In this case, the CGI script for adding an entry is actually much simpler than the searching script. In this case, we will assume that the database is a world writeable file.
#!/bin/perl

require cgi_head; # Set up the CGI environment

while (-e "datalock") { sleep 1; } # If database is currently being
                   # modified, wait.
system("touch datalock");        # Lock database

open (DAT, "database.txt"); # open the database for reading
while (<DAT>) { $line = $_; } # Read the last line of the database
close DAT;
if ($line =~ /:/) {
        @field = split (/:/, $line);
        $num = $field[0]; # Get last ID number
        $num++;
} else { $num = 0; }      # Create new ID number

open (DAT, ">>database.txt"); # open the database for appending

# Add entry to database
print DAT
"$num:$FORM{'name'}:$FORM{'phone'}:$FORM{'street'}:$FORM{'city'}:$FORM{'state'}:
Â$FORM{'zip'}\n";

close DAT;
system ("rm datalock");
print <<EOF;
<HTML><HEAD><TITLE>Addition Successful</title></head>
<BODY>
<h4>Your entry has been added to the address book</h4>
<A HREF="add.html">Go</a> back to the form to add another user.
</body></html>
EOF
In effect, this CGI script simply appends the new entry to the database. The first snag becomes file locking. If someone else is modifying the database at exactly the same time, one of the changes will be lost or the entire database will become corrupted. To circumvent this, we use a lock file to tell if someone else is writing to the database. This is far from the most elegant solution, and most systems provide a flock() function to more effectively lock the file from simultaneous access. Secondly, the ID number of the entry must be determined. In this case, we can assume that the entries will be sequentially numbered and that the last entry will have the last ID number. So we simply read the last line of the database, grab the ID number from that, and then increment it to obtain the new ID number.
Now that anyone can add entries to the address book, it may become necessary to delete or modify entries. To do that, however, there must be some way for the user to indicate the desired entry to modify or delete. Instead of creating a whole new form for this, we can add this functionality to our existing search CGI. If the user's search returns exactly one result, a line can be added to the HTML result page offering the option to modify or delete this entry. (This could be done for more than one result fairly easily, but we will stick with one for brevity's sake.) This can be done by changing the following lines at the bottom of the search CGI:
print <<EOH;
</pre><br>
Thank you for using my address book.
<A HREF="addrbk.html">Go</a> back to the form to make another search.
</body></html>
EOH
to:
print "</pre><br>\nThank you for using my address book.\n";
print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another search.<br>\n";
if ($#results == 0) {
print "<A HREF=\"change.cgi?a=d&n=$result[0]\">Delete</a> this entry.<br>\n";
print "<A HREF=\"change.cgi?a=c&n=$result[0]\">Modify</a> this entry.<br>\n";
}
print "</body></html>\n";
The added lines print links to a new CGI program, passing two values: a parameter indicating whether a deletion or a modification is wanted, and the ID number of the entry to delete or modify.
Because our database is delimited, we will have to regenerate the entire database to make a change, as shown in Listing 10.2.

Listing 10.2. Outputting an ASCII database.
#!/bin/perl

require cgi_head; # Set up CGI environment

while ( -e "datalock" ) { sleep 1; } # Wait while someone else is
                     # modifying database.
system ("touch datalock"); # Lock the database.

# Load database
open (DAT, "database.txt") || die "Can't open the database: $! !.\n";
$maxn = 0; # A counter for the number of entries.
while (<DAT>) {
    chop;
    @field = split(/:/); # Split the line into the data fields.
    $n = $field[0]; # First field is an id number
    $add[$n]{'name'} = $field[1]; # Then the name
    $add[$n]{'phone'} = $field[2]; # The phone number
    $add[$n]{'street'} = $field[3]; # The street address
    $add[$n]{'city'} = $field[4]; # The city
    $add[$n]{'state'} = $field[5]; # The state
    $add[$n]{'zip'} = $field[6]; # The Zip Code
    $add[$n]{'line'} = $_ . "\n"; # The entire line
}
$maxn = $n;

close DAT;

open (DAT, ">database.txt"); # Open database for writing.
if ($FORM{'a'} eq "d") {               # If a deletion is being requested,
    for ($I = 0; $I <= $maxn; $I++) {   #print all entries except the
        unless ($I == $FORM{'n'}) { # one to be deleted.
            print DAT $add[$I]{'line'};
        }
    }
    # Print a message then exit.
    print <<EOP;
<HTML><HEAD><TITLE>Request successful</title></head>
<BODY>
<H3>The selected entry has been deleted.</h3>
<A HREF="addrbk.html">Go</a> back to make another search.
</body></html>
EOP

    close DAT;
    system ("rm datalock");
    die;
} elsif ($FORM{'a'} eq "c") {

# If the user wants to modify the entry, things become a bit trickier.
# We must first print out a form, similar to the original form, to allow
# the user to change the values of the entry.
    $n = $FORM{'n'}; # Put the entry to be changed in an easier to type
            # variable.
    print <<EOF;
<HTML><HEAD><TITLE>Entry Modification</title></head>
<BODY>
<h4>Make the desired changes in the form below.</h4>
<FORM ACTION="change.cgi" METHOD="POST">
<INPUT TYPE=HIDDEN NAME="a" VALUE="m">
<INPUT TYPE=HIDDEN NAME="n" VALUE="$n">
Name: <INPUT SIZE=45 NAME="name" VALUE="$add[$n]{'name'}"><br>
Phone: <INPUT SIZE=45 NAME="phone" VALUE="$add[$n]{'phone'}"><br>
Street: <INPUT SIZE=45 NAME="street" VALUE="$add[$n]{'street'}"><br>
City: <INPUT SIZE=20 NAME="city" VALUE="$add[$n]{'city'}">
State: <INPUT SIZE=3 NAME="state" VALUE="$add[$n]{'state'}">
Zip: <INPUT SIZE=6 NAME="zip" VALUE="$add[$n]{'zip'}">
<br><br>
<INPUT TYPE=SUBMIT VALUE="  Modify Entry  ">
<INPUT TYPE=RESET VALUE="  Reset Form  ">
</form></body></html>
EOF

    # This form adds two hidden fields, telling this CGI which entry to
    # modify.
    for ($I = 0; $I <= $maxn; $I++) { print DAT $add[$I]{'line'}; }
    close DAT;
    system ("rm datalock");
    die;

} elsif ($FORM{'a'} = "m") {
# Make the change on the modified entry.
    $n = $FORM{'n'}; # Copy the entry to be changed into a more
             # typeable variable.
    # Assign the modified values to the entry.
    $add[$n]{'name'} = $FORM{'name'};
    $add[$n]{'phone'} = $FORM{'phone'};
    $add[$n]{'street'} = $FORM{'street'};
    $add[$n]{'city'} = $FORM{'city'};
    $add[$n]{'state'} = $FORM{'state'};
    $add[$n]{'zip'} = $FORM{'zip'};
    $add[$n]{'line'} =
"$n:$add[$n]{'name'}:$add[$n]{'phone'}:$add[$n]{'street'}:$add[$n]{'city'}:$add[$n]{'state'}:
Â$add[$n]{'zip'}\n";
    for ($I = 0; $I <= $maxn; $i++) { print DAT $add[$i]{'line'}; }
    close DAT;

    print <<EOE;
<HTML><HEAD><TITLE>Modification successful</title></head>
<BODY>
<H4>The requested entry has been modified.</H4>
<A HREF="addrbk.html">Go</a> back to the form to make another search.
</body><//html>
EOE
    system ("rm datalock");
    die;
} else { die; } # This should never be reached.

Now we have a complete address book system in which entries can be added, deleted, modified, and searched in any number of categories. As it is, though, this address book is lacking in several important areas:
  • Clean Code-For readability's sake, the code in this example is rather naive. There are several sections that could be optimized, and the entire thing could benefit from a more object-oriented approach. Object orientation is one of the most important things to strive for when developing a CGI program (or any program designed for use on the Internet). Unless your code must be proprietary, there is a good chance that it will travel far and wide across the Web and will be used by many different people in many different ways. A modular approach allows people to reuse portions of the code that are useful to them and makes understanding the structure of the program much easier.
  • User Interface-The HTML pages generated by these CGI scripts are Spartan to the extreme. A well-thought-out design can make your CGI more pleasant to use, encouraging people to come back. It is important, however, not to take it too far. If your database is for business or academic purposes, the Web pages should reflect that. Also make sure (especially if you are in an academic environment) that your scripts are usable by as many people as possible. Many people are still connecting to the Internet through modems 14.4 Kbps and slower. Large graphics, Java applets, or other needless bells and whistles can make your page worthless to a large potential audience. And if at all possible, make the page usable from a line-mode browser such as Lynx. Such browsers are the last refuge for people with very slow connections, and are the only means of access for the visually impaired and other physically challenged people.
  • Functionality-While it's important not to make your CGI programs too complicated to use, it helps to anticipate the needs of your users and to include as much functionality as possible. In our address book, for example, a better search engine would greatly improve its usefulness. As it stands, the script returns only entries that match exactly with database entries. If the scripts could do substring matching, a person could search for all entries with a common last name or all people who have the same telephone prefix. With Perl's pattern matching and regular expression capabilities, this would not be a difficult addition.

Web Indexing

Somewhere between small text databases and large database servers are databases that contain information about the Web itself. Such databases provide for users the capability to search for information on a site without having to look at every page by hand. The most common modus operandi of a Web index is as follows: A user enters one or more keywords into an HTML form; the search engine gathers the URLs of pages that match the keywords; the results are returned to the user weighted by some sort of scoring mechanism.

Indexers

The first step in putting a searchable index of information on the Web is generating that index. A number of freely available packages exist on the Internet to do just that, including Wais, Swish, Ice, and Glimpse.

Wais

Probably the most common Web indexer in use today (which predates the existence of the Web) is Wais (or freeWais or freeWais-sf). Wais was originally developed by Wais, Inc. (now owned by America OnLine). The most recent development on Wais has branched off into a freely redistributable version called freeWais and an enhanced version called freeWais-sf. Information about Wais in general (and freeWais-sf in particular) is available at
http://l6-www.informatik.uni-dortmund.de/freeWAIS-sf/
Source code is available from
ftp://ftp.germany.eu.net/pub/infosystems/wais/Unido-LS6/
Wais was designed as an all-purpose text indexer but is very useful at indexing HTML and other Web-related documents.
Installing freeWAIS-sf creates several programs, including waisserver, waissearch, waisq, and waisindex. Waisserver is a daemon that accepts requests from any machine on the Internet, processes queries, and returns information on the requested documents with weighted scoring information. Waissearch is a client used to connect to waisservers across the Internet. Waisq is a client for use on a local server. Waisindex is the actual index program. It takes a list of files and generates a database containing all the words on the files, sorted and weighted by a number of criteria. At the current time, indexes generated by waisindex are about twice the size of the original documents.

Swish

Swish was developed by Kevin Hughes of EIT. It is available at
http://www.eit.com/goodies/software/swish/
Swish was designed from the ground up as an HTML indexer. It is not (nor does it claim to be) as complex or full-featured as Wais, but it is much smaller, simpler to install, and easier to maintain. Both the indexer and the search engine are in the same program. Also, because it was designed for the Web, Swish is able to take into account HTML tags, ignoring the tags themselves and giving higher precedence to text within certain tags (like headers). One of the most noticeable drawbacks of Swish is that it does all of the indexing in RAM. So the total of all the files you wish to index cannot exceed your RAM (Wais offers a maximum RAM switch with its indexer). However, unless you have a very large site (say, over 30 MB of files on a 32 MB machine), this should not be a problem.

Ice

Ice is a Web indexing program written entirely in Perl. It uses a very simple indexing format that becomes slow with large numbers of documents but is very fast and efficient for sites with up to a couple of thousand files. Ice was created and is maintained by Christian Neuss and is available at
http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html
Ice also supports a thesaurus file, which allows for synonyms and abbreviations while searching.

Glimpse

Glimpse is a fairly new entry in the indexer wars, having just now gained widespread attention as the default search engine of the Harvest system. Glimpse is similar to Wais in that it builds as several executables and offers many options when searching. Glimpse also appears to be highly intuitive, with most of its advanced searching options accessible with a simple command-line switch. It is being developed at the University of Arizona and is available at
http://glimpse.cs.arizona.edu:1994/

Search Engines

Once the index of files exists on your server, the next step is providing a way for users to access this from the Web. This is where CGI comes in. A CGI program must take a set of keywords (or some other sort of query) from a form, pass it to the search engine, and then interpret the results. Because all the work is done by the indexer/search engine, this front end can be fairly simple. Not coincidentally, there are dozens of them available on the Net, and it is not a major task to customize one for your own use.

Wais Front Ends

Due to the popularity of Wais, interfaces between it and the Web are very common. A Perl interface (WAIS.pm) is standard with certain releases of freeWAIS-sf. Another Perl front end (wais.pl) comes with ncSA httpd. A list of other interfaces between Wais and the Web can be found on Yahoo at
http://www.yahoo.com/Computers_and_Internet/Internet/Searching_the_Net/WAIS/

Other Front Ends

Several front ends exist for the other search engines, as well. Ice comes with its one CGI program (ice_form.pl). WWWWAIS is a program by the maker of Swish that serves as a front end to both Wais and Swish indexes. It is available at
http://www.eit.com/goodies/software/wwwwais/
Harvest is an ambitious set of tools developed by Colorado University, Boulder, which hopes to provide a central package to "gather, extract, search, cache, and replicate" information across the Internet. Harvest uses Glimpse as its default search engine. It is available at
http://harvest.cs.colorado.edu/harvest/

Rolling Your Own

With a little thought and effort, it is not hard to create your own custom front end for an existing search engine. A few things must be considered:
  • Getting the information from the form-This is the easy part. As mentioned earlier in the chapter, there are packages for all CGI languages that serve to retrieve information from a form and store it in variables of some kind. In this case, the information will be a list of keywords and perhaps some constraints such as a Boolean AND or OR search, a maximum number of results to return, or a specific index to search.
  • Parsing the information-Before passing it to the search engine, the data must be put in the right form (usually as command line arguments). Also at this stage, simple error detecting can be performed. Checks should be made that the user entered all necessary data in the form.

Caution
At this point, the program should also check to make sure that the user is not trying to pull a fast one. In the next step, an external program is called, so care must be taken to prevent the infamous keyword; rm -rf / trick. Almost universally, a semicolon is a command separator, and so a wannabe attacker could insert one into his or her query, followed by his or her malicious command(s).
Don't fall into this trap.

  • Calling the search engine-Now that all of the information has been verified as safe and is in the correct format, it must be passed to the search engine. Using UNIX-derived languages (such as C/C++ and Perl), this is most effectively accomplished by using process pipes. Consider the following snippet of Perl code that takes the prepared information, passes it to the Wais search engine, and then reads the output:


pipe(P0R,P0W); # Creates one read/write pipe
pipe(P1R,P1W); # Creates another read/write pipe

if ($pid = fork) { # This created a new process,
    # This is the parent process
    close(P0R); # Close the read end of the first pipe
    close(P1W); # and the write end of the other one
    &read_from_wais(P1R); # This calls a subroutine which is fed input
    # into P1R. It then interprets it into search results.
} elsif (defines $pid) {
    # This is the child
    close(P0W); # Close the write end of the first pipe
    close(P1R); # Close the read end of the second pipe
    open(STDIN, "<&P0R"); # Duplicate P0R as the standard input
    open(STDOUT, ">&P1W"); # Duplicate P1W as the standard out
    # Now the standard output will travel through P1W into P1R which
    # is being held by the parent who sends it off to the subroutine.
    exec(@argline) || die; # @argline holds the command to execute the
                   # Wais search engine
    # At this point the child dies
    } else { die("Can't fork!"); } # This is only reached if fork()
                       # fails
    # Parent now continues with any information retrieved from the
    # search engine.
Manipulating pipes and forks can be tricky at first, but it greatly increases the power of interprocess communication, which is necessary to interact with an external search engine.

Large Scale Databases

At some point, you may encounter a project that is simply too big for a text-based database and is not suited for a text-indexing system. Fear not; others have been down this road and fortunately have left a lot of software behind to help integrate large database servers with the Web. A "large scale" database need not be large, per se. It is simply any database that is not a flat ASCII file. Popular commercial databases apply, such as dBASE, Paradox, and Access (although they are all able to read ASCII files, it is just not their preferred method of storing information). Also fitting this category are database servers such as Sybase, Oracle, and mSQL.
When dealing with a large-scale database, the trick is not in storing or manipulating the data as it is with the text database. The database server does all that work for you. The trick is communicating with the database server. There are almost as many database communication protocols as there are databases, despite the existence of some very complete standards (such as SQL). Programs exist for practically every database that has communications capabilities to interface with the Web. A list of some programs follows:

Tip
Much of the information that follows can be found online (in, no doubt, an updated form) at Jeff Rowe's excellent page
http://cscsun1.larc.nasa.gov/~beowulf/db/all_products.html

  • 4D
    NetLink/4D (http://www.fsti.com/productinfo/netlink.html)-This is a commercial product for Macintosh computers running the WebStar server. It allows users to directly manipulate 4D databases.
  • Microsoft Access
    4W Publisher (http://www.4w.com/4wpublisher/)-This is a commercial product that generates static HTML pages from an Access database. A CGI version is due out soon that will allow dynamic access to the database.
    A-XOrion (http://www.clark.net/infouser/endidc.htm)-This is a custom commercial database server for the Windows platform. It allows real-time access to major brand pc databases (Paradox, dBASE, FoxPro Access), but it requires Access to run.
    dbWeb (http://www.axone.ch/dbWeb/)-This is a freeware tool to maintain large hypertexts using an SQL interface to Access. It has the capability to export large documents in a variety of formats including HTML pages, Microsoft Viewer, and tagged text.
  • DB2
    DB2WWW (http://www.software.ibm.com/data/db2/db2wfac2.html)-This is IBM's own proprietary (and commercial) interface to their DB2 database system. It uses SQL statements to interact with the Web server and generate HTML pages.
  • FileMaker Pro
    ROFM (http://rowen.astro.washington.edu/)-This is a freely available interface to FileMaker Pro databases. It is only for Macintosh computers.
  • FoxPro
    FoxWeb (http://www.foxweb.com/)-This is a commercial product for Windows NT that allows Web server integration with FoxPro databases.
  • GemStone
    GemStoneWWW (http://ftp.tuwien.ac.at/~go/Implementation.html)-This is a mini-HTTP server that handles requests that interface with a GemStone Smalltalk database system. It is freely available.
  • General
    Amazon (http://www.ieinc.com/webnews.htm)-This is an all-purpose, commercial "legacy system" integrator. That is, it takes data from Web forms and uses it to interact with systems that were designed before the Web, such as Oracle, Sybase, SQL Server, ODBC systems, and others.
    DBI (http://www.hermetica.com/techologia/DBI)-DBI is a database API (Application Programming Interface) for Perl. It is sort of a catch-all set of functions, variables, and conventions used to access any sort of database. There is currently support for Oracle, mSQL, Ingres, Informix, Sybase, Empress, C-ISAM, DB2, Quickbase, and Interbase databases, and more are in the works. Like all good Perl software, this is absolutely free.
    DBGate (http://fcim1.csdc.com/DBGate/dbintro.htm)-This is a commercial "serverless" database product. It interacts directly with the database files (Access and FoxPro are currently supported), thereby removing the need for ODBC or SQL communication with a database server.
    HyperStar WDK (http://www.vmark.com/Products/WDK/index.html)-This (commercial) package acts as a go-between from the Web browser to the server, interpreting database commands along the way. It supports a large number of databases including Oracle, Sybase, Informix, UniVerse, Ingres, Microfocus Cobol ISAM, and PI/open.
    mgyWeb (http://www.mgy.ca/mgyweb/)-This is a commercial solution for Windows 95 or NT that supports "many" databases and any Windows 95-compatible Web server.
    Sapphire/Web (http://www.bluestone.com/products/sapphire)-This is a commercial application builder that enables users to easily create C/C++ CGI programs that interface with several popular databases.
  • mSQL
    MsqlPerl (ftp://Bond.edu.au/pub/Minerva/msql/Contrib/MsqlPerl-1.03.tar.gz)-This is a Perl interface to Msql that mimics the Msql API for the C language. It allows on-the-fly updates of any mSQL database and is freely available.
  • ODBC
    Cold Fusion (http://www.allaire.com/cfusion/)-This is a commercial product for Windows 95 and NT that has become very popular in those circles. It allows for direct connectivity to any ODBC database. It contains scads of features and an easy-to-use interface.
    DataRamp (http://dataramp.com)-A commercial client/server/browser combo that allows secure read/write access to any ODBM database (including Access).
  • Oracle
    Decoux (http://www.abs.ee/~wow/htdocs/sdk10/decoux/)-This is a CGI interface to Oracle databases. It uses SQL statements imbedded in HTML tags that are parsed before sending the results to the client.
    Oracle WWWIK (http://www.abs.ww/~wow/htdocs/)-This is a collection of CGIs and other tools to connect Oracle databases to the Web. They are all freely available.
    ORAlink (http://oradb1.jinr/dubna.su/Software/ORALink)-This is a freeware product for Windows NT that uses CGI to access Oracle databases from the Web.
  • SQL
    dbCGI (http://www.progress.com/webtools/dbcgi/dbcgi.htm)-This is a freely available C program that allows SQL statements to be embedded into HTML files. A CGI program interprets the statements and takes appropriate action before displaying the HTML to the client.
    GSQL (http://www.santel.lu/SANTEL/SOFT/gsql_new.html)-This is a Mosaic only interface with SQL database servers. It is freely available.
    web.sql (http://www.newmedia.sybase.com/Offerings/Websql/web_spec.html)-This is a free product from Sybase that allows SQL statements to be imbedded into HTML files. The statements are then translated by a CGI program, and the results are passed to the client.
  • Sybase
    Genera (http://gdbdoc.gdb.org/letovsky/genera/genera.html)-This is a freely available package that integrates Sybase databases with the Web. It supports Web page generation from Sybase databases and HTML form queries.
    Sybase WWW Tools (http://www.sybase.com/WWW/)-A collection of CGIs and other tools for integrating Sybase databases for the Web. Freely available.
This is nowhere near a complete list of CGI resources for databases, and new products are being developed constantly. Consider this a jumping-off point to explore the possibilities. But suppose you've painstakingly checked out all of the products available, and you still cannot find a CGI program to meet your needs. All is not lost. By using common tools found on many Web sites, it may be possible to build your own CGI to interact with your database server of choice.
Consider the address book example from the beginning of the chapter. How could that be accomplished with a database server? First we need to decide on the right tools for the job. A good freely available database is mSQL (http://www.hughes.com.au/). It offers most of the functionality of a full-fledged SQL server, with low overhead and easy installation. The mail mSQL distribution comes with a C API and function library, but we'll stick with Perl just 'cause it's cool. There are several mSQL interfaces for Perl, some of which were mentioned in the preceding section. We'll use MsqlPerl for no particular reason (I've found that all of the Perl mSQL interfaces work fairly well).
The MsqlPerl module provides direct emulation of the C API functions. You can submit queries to the database which add, select, delete, and modify the database directly.

Note
There are two things to be aware of here. First, the program msqld must be running on whatever machine the database is stored on. This need not be the same machine as the Web server because the API provides functions to connect to a remote mSQL server.
Secondly, MsqlPerl is compiled as a dynamically loaded extension to Perl. Certain systems have trouble dynamically loading from Perl (notably AIX and non-ELF Linux). If your MsqlPerl script is failing, make sure that your copy of Perl is able to dynamically load properly.

We can use the same HTML form pages we used for the text database example. The first form searched an existing database. The appropriate CGI program is shown in Listing 10.3.

Listing 10.3. Searching an SQL database.
#!/bin/perl

require cgi_read; # Set up CGI environment
use Msql; # Load the Msql module (See note above for a caveat.)

$dbh = Connect Msql; # Connect to the local mSQL server

SelectDB $dbh "addresses"; # Selected the "addresses database. Assume that
               # it already exists. mSQL comes with a utility
                  # called 'msqladmin' which can create databases.

$all = "name, phone, street, city, state, zip"; # All address fields
$query = ""; # Set aside a variable to hold the query.

foreach (keys %FORM) {        # Gather all existing form requests into
    $query =. " $_ = $FORM{'$_'} AND";  # one line.
}
$query =~ s/AND$//;           # Get rid of that annoying trailing 'AND'

$sth = Query $dbh "SELECT $all FROM addresses WHERE $query"; # Send query

print "<HTML><HEAD><TITLE>Search Results</title></head><BODY>\n";

$I = 0; # 0 mean no results.
while (@arr = FetchRow $sth) {
    if ($I == 0) { $I = 1; } # 1 means 1 results
    if ($I == 1) { # Print success message
        print "<H4>Your search results are listed below</h4>";
        print "<PRE>\n";
    };
    $I++;
    # Print results one at a time.
    print <<EOF;
-- --

Name: $arr[0]
Phone: $arr[1]
Street: $arr[2]
City: $arr[3]
State: $arr[4]
Zip: $arr[5]

EOF
}
if ($I == 2) {
    print "</pre><br>";
    print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another
Âsearch.<br>\n";
    print "</body></html>\n";
} elsif ($I == 0) {
    print <<EOE;
<HTML><HEAD><TITLE>Search Failed!</title></head><BODY>
<h4>There are no entries which match your criteria</h4>
<A HREF="addrbk.html">Go</a> back to the form to make another search.
</body></html>
EOE
}

The actual CGI code in this case is about one quarter the size of the equivalent CGI in the first example. In addition, it is trivial to add substring matching to an mSQL query. Also note the distinct absence of ID numbers. mSQL is a relational, random access database. That is, it does not need to read in all entries to access any one of them.
The real power of a database server becomes apparent when modifying the database itself.

Note
Because the modification of the database is done by the server itself, the file permission issues of text databases do not apply here. Instead, most database servers have their own access rights schema that can be used to allow only select users to access certain databases.

Now we consider the CGI required to add entries to the database. Once again, we assume the a database called addresses exists that contains a table called addresses.
#!/bin/perl

require cgi_head;
use Msql;

$dbh = Connect Msql; # connect to the local mSQL server

SelectDB $dbh "addresses" # select the 'addresses' database

Query $dbh "INSERT INTO addresses ( name, phone, street, city, state, zip ) VALUES
Â('$FORM{'name'}, $FORM{'phone'}, $FORM{'street'}, $FORM{'city'}, $FORM{'zip'} )";

print <<EOF;
<HTML><HEAD><TITLE>Addition successful</title><head><BODY>
Your entry has been added to the address book.
<A HREF="add.html">Go</a> back to the form to add another entry.
</body></html>
EOF
Now I'll bet you're wondering why you ever used flat text databases to begin with! In a fraction of the code of the flat text version, a much more flexible addition scheme is produced. Of course, the biggest problem in using a database server is finding one. There are many situations in which you may have no control over which programs you have access to. In these cases, a text database may be your only recourse. Also, if a text database becomes corrupt, you load it into your favorite text editor and fix it. If your relational database becomes corrupt, unless you are a database expert you better have recent backups available.
Now into the final stretch, allowing for deleting and modifying entries from the database. Like the text database, a change must be made to the HTML form generated by the search program to allow the user to delete or modify the entry. Unlike the text database, we have no unique ID number to identify the entry. Therefore, we must pick another unique attribute to identify the entry. We could add an ID number into the database, but for simplicity's sake, we will use the name as the unique field. (And if you know two people with the exact same first, last, and middle names, you know too many people.) In databases, this sort of unique field is known as a "primary key."
So the lines near the end of the search CGI program
print "</pre><br>";
print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another search.<br>\n";
print "</body></html>\n";
become
print "</pre><br>";
print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another search.<br>\n";
if ($I == 1) {
print "<A HREF=\"change.cgi?a=d&name=$arr[0]\">Delete</a> this entry.<br>";
print "<A HREF=\"change.cgi?a=c&name=$arr[0]\">Modify</a> this entry.<br>";
}
print "</body></html>\n";
And the CGI itself simplifies considerably:
#!/bin/perl

require cgi_head;
use Msql;
$dbh = Connect Msql;
SelectDB $dbh "addresses";
$all = "name, phone, street, city, state, zip";

if ($FORM{'a'} eq "d") {
    Query $dbh "DELETE FROM addresses WHERE name=$FORM{'name'}";
    print <<EOF;
<HTML><HEAD><TITLE>Deletion successful</title></head><BODY>
<h3>Your entry has been deleted</h3>
<A HREF="addrbk.html">Go</a> back to the form to make another search.
</body></html>
EOF
    die;
} elsif ($FORM{'a'} eq "c") {
    $guy = Query $dbh "SELECT $all FROM addresses WHERE name=$FORM{'name'}";
    @guy = FetchRow $guy;
    print <<EOE;
<HTML><HEAD><TITLE>Modify Me</title></head><BODY>
<h4>Modify your entry in the form below</h4>
<FORM ACTION="change.cgi" METHOD="POST">
<INPUT TYPE=HIDDEN NAME="a" VALUE="m">
Name: <INPUT SIZE=45 NAME="name" VALUE="$guy[0]"><br>
Phone: <INPUT SIZE=45 NAME="phone" VALUE="$guy[1]"><br>
Street: <INPUT SIZE=45 NAME="street" VALUE="$guy[2]"><br>
City: <INPUT SIZE=20 NAME="city" VALUE="$guy[3]">
State: <INPUT SIZE=3 NAME="state" VALUE="$guy[4]">
Zip: <INPUT SIZE=6 NAME="zip" VALUE="$guy[5]">
<br><INPUT TYPE=SUBMIT VALUE="   Modify Entry   ">
<INPUT TYPE=RESET VALUE="   Reset Form   ">
</form></body></html>
EOE
    die;
} elsif ($FORM{'a'} eq "m") {
    foreach (keys %FORM) {        # Gather all existing form requests into
        $query =. " $_ = $FORM{'$_'},;  # one line.
    }
    $query =~ s/,$//;           # Get rid of that annoying trailing ','
    Query $dbh "UPDATE addresses SET $query WHERE name=$FORM{'name'}";
    print <<EOF;
<HTML><HEAD><TITLE>Modification successful</title></head><BODY>
<h3>Your entry has been modified</h3>
<A HREF="addrbk.html">Go</a> back to the form to make another search.
</body></html>
EOF
    die; # Th' th' that's all folks
}
It is as simple as that. Fifty lines of Perl as opposed to over one hundred. The sheer flexibility of a database server opens the door to worlds of possibilities.

Summary

The world of database interaction with the Web is as wide as it is deep. For any problem, there are dozens of solutions-some better than others. It all boils down to personal preference and experience. For a person with a background in Windows data processing, it would probably be easiest to use a Windows database such as Paradox or Access along with one of the ready-made CGI interfaces for them. A seasoned C programmer may prefer playing with one of the APIs provided for the various SQL servers such as Sybase, Oracle, SQL Server, and mSQL. Someone who is "just another Perl hacker" would probably feel most comfortable with one of the many Perl interfaces to SQL and ODBM database server (or perhaps they would just write one of their own).
Within this realm of great flexibility, certain things must be considered to make an effective Web-Database interaction:
  • What tools are available?  Are you the systems administrator of a high-powered UNIX server? Or perhaps your only Internet connection is a PPP linked Windows machine. It is possible to get almost any job done with any tool, but it helps greatly to know your capabilities before you begin. If you are using a Windows platform, it may not be worth it to fight with badly implemented versions of programming languages when there is a huge library of tools already available, albeit for a cost. If UNIX is your stomping grounds, you may want to think twice before shelling out thousands of dollars for a Sybase or Oracle server when free tools such as mSQL and POSTGRES are easily available. (But then again, perhaps your needs are so complex that only a high-end database server will suffice.) It pays to be aware of what you have available to you.
  • What is the size of your project?  Size does matter. As we have seen in this chapter, flat text based databases simply don't scale well with size. A 50 MB version of our phone book could bring even the most powerful RISC server to its knees. At the same time, it takes time to construct a database using a server. You need to have permission, and above all, you need to have access to a server. If you are a student with a university granted account, there's a good chance that text is all you have to deal with. Even between database servers, the size and complexity of your project makes a difference. Keeping a database of every phone number in America might be tough job for a personal database like Access or Paradox. On the other hand, a database with tables linked to tables in weird and convoluted ways is simply not possible in free servers like mSQL. (POSTGRES does has more functionality, at the price of speed, complexity, and lack of standard SQL interface.) Before you begin your project, try to anticipate its maximum size and plan accordingly.
  • To program or not to program (which language is the question)?  With the rather considerable list of resources shown earlier in the chapter, why would anyone need to write their own CGI at all? Perhaps you don't. If you have the money to spend, there is a good chance that something is out there that will suit your needs. Even if you aren't rich, you would probably find something close. But it is not always close enough. Maybe one little tweak would add that last little feature you want. Or perhaps the existing software does too much, and you would prefer the added speed gained by a few less bells and whistles. A little knowledge of how CGIs communicate with databases can go a long way.
  • Is there an alternative to CGI?  The Web is a big place, growing every day, and CGI is no longer the only game in town. Don't fall into the trap of ignoring new technologies just because they are new. (At the same time, don't be blind to their faults. Every new language or protocol must go through a testing period. During that time, applications using that language or protocol have great potential to cause damage, either directly or by opening security holes.) Several companies have developed original Web servers that are specifically designed to interface with databases. For example, NeXT Inc.'s WebObjects (http://www.next.com) is an innovative object-oriented Web server that directly interacts with a powerful back-end database. In addition to these there is Java, which has gained an enormous amount of publicity in its first year on the Internet. Java has the capability to open persistent connections between the server and client, providing a means for updating the user's view of the database continually. For example, using Java with a database of stock prices could provide a running stock ticker on the user's screen. Beware, however, Java is the archtypal "new technology" with all of the good and bad that comes with it.
Using CGI to integrate databases with the Web follows naturally from the capabilities of CGI. Bypassing a great deal of CGI's shortcomings, database interaction is perhaps one of the only areas in which CGI will remain the best tool for the job in the face of new technologies.






Security:Chapter 9


Chapter 9

Security





CONTENTS


Computers on a public network like the Internet can be vulnerable to misuse by malicious network users, or crackers. CGI is another Internet mechanism that can and must be deployed with security issues in mind.

Is CGI Insecure?

At the time of the writing of this chapter, a significant proportion of the announcements on the Internet security mailing lists and bulletin boards describe CGI-related security problems. Subscribers could easily get the impression that CGI represents a security risk to any organization that employs it. Is this fear justified?

CGI Is a Power Tool-Use with Care

The Common Gateway Interface specification is not insecure per se. The specification defines a way for World Wide Web servers to interact with query engines and information gateways. It entails the use of environment variables and standard input and output streams, none of which are fundamentally vulnerable. It is not the interface that is insecure.
However, CGI represents a powerful feature of many Web browsers. This feature allows a Web server not only to provide information, but also to provide access to the computing power of the server. It is important to note that a Web server that supports CGI gateway engines also gives Web browser users a degree of control over what the Web server does.
Careful use of CGI can deliver interactive Web sites, user-friendly information retrieval, and access to information not designed for the World Wide Web. This is achieved by allowing the Web browser user to control the information delivery and by implementing automatic translation of data from one form to another.
Careless use of CGI can and will compromise the security of the information provider. A CGI application implemented without due regard to security issues will allow the Web browser user much more control over the Web server than the programmer intended. If an organization is complacent about the security of its World Wide Web server, it should expect abuse of its computing facilities, downtime due to malicious attacks, and loss of information integrity of confidentiality.

CGI-Related Security Vulnerabilities-An Example

Security vulnerabilities result from programming or implementation that does not guard against accidental or deliberate misuse. An example of this is a typical CGI gateway for accepting data typed into a World Wide Web form and passing it on as an e-mail message, as shown in Listing 9.1.

Listing 9.1. An insecure HTML form handler.
#!/usr/local/bin/perl
# formmail.cgi
# Accepts form submission and resends as an e-mail message to "webweaver"

# Call library routine to translate and split form submission
# into perl variables $input{"field"}
require "cgi.pl";

# Launch e-mail application "/bin/mail" with Subject: header from the "formname" Âfield
open (MAIL, "|/bin/mail -s ' ".$input{"formname"}." ' webweaver");

# And send "formcontents" field as the body of the message
print MAIL $input{"formcontents"};

close(MAIL);
exit(0);

This CGI gateway program will do what the programmer intended for most form submissions. It sends the form contents to the e-mail address "webweaver" using the auxiliary program /bin/mail.
However, this apparently simple and benign gateway could be a security loophole because it does no checking on the user supplied form data before passing it to the mail program. Notice what happens to the user-submitted data. A library module, cgi.pl, unpacks the form submission, restoring any characters that have been rewritten for safety by the Web browser or Web server, and then uses that information in a command interpreter to launch another program, /bin/mail. The data from the form field "formname" is passed unchecked to a command interpreter as an argument to the /bin/mail command. Then the data from the form field "formcomments" is passed unchecked as input to the auxiliary mail program.
The security vulnerability arises because command interpreters and several other applications assign special meanings to certain characters in their input. If the Web form user maliciously or even innocently included such special characters in either form field, the form submission could have side effects that the programmer did not anticipate. A malicious Web user could include operating system commands in either form field, and by surrounding them with appropriate special characters, have them run on the Web server. These commands could damage data integrity or allow the Web user unauthorized access to data on the Web server. They might even be used to give the user full control of the Web server.
For example, the cracker might construct a form submission in which the "formname" field is set to
'`grep root /etc/passwd` cracker@illegal.org #'
This CGI gateway program is a security hole waiting to be hacked, simply because the programmer failed to check the user-supplied data before passing it on to other programs.

General Internet Security Issues

Other security vulnerabilities can arise from assuming that the "conversation" between the Web browser and the Web server is private. Inviting the user to enter secret passwords, credit card numbers, and other confidential information puts the confidentiality of that information at risk. The Internet is a public network. World Wide Web form submissions are usually unencrypted. It is possible that the information in the form submission could be captured and read somewhere between the browser and server.
The mechanisms that make up the Internet are themselves less secure than some Internet users realize. For instance, there is no easy way to prove that an Internet electronic mail message is genuine. Forging mail messages that appear to come from one person but actually come from someone else is trivial, especially now that more and more Internet users install and configure their own e-mail applications. Internet data streams purporting to come from one source can be spoofed or hijacked by skilled crackers. These are not insecurities in CGI but should be taken into account in any assessment of the security of CGI.

Alternatives to CGI

Some Web servers use proprietary interfaces as an alternative to CGI. While these may be claimed to be more secure than the CGI system, they often limit what the gateway programmer can do or allow just as much misuse.
Some Web browsers support secure network communications, interactive and programmable features such as built-in browser control scripting languages, or even full network application systems. These can be used to add a level of security to CGI, or even to achieve similar results but with browser-side processing instead of server-side handling. However, an information provider exploiting these features will bar access to the information for users with other browsers.

Can CGI Be Used Securely?

Despite the dangers described previously, if the Web information service implementors design their implementation to guard against potential misuse, a CGI gateway can be profitable and useful without introducing security vulnerabilities.

Security Is the Responsibility of Both Programmer and Administrator

The implementors of a Web service are jointly responsible for CGI security, but the defensive weapons in their armory differ according to their role.

What Can a Web Server Administrator Do to Improve Security?

The system administrator of a Web server can do much to defend against CGI misuse, as detailed in the following sections.

Work with Programmers

The administrator should discuss with programmers all CGI-based implementations and security risks. Together they can share information on known security problems with server software and establish codes of practice that reduce the risk from attacks. They could also implement a process of peer review through which programmers review each other's code for possible security vulnerabilities.

Use Well-Respected Server Software

Security concerns should influence the choice of Web server software. Both the HTTP server, and any other off-the-shelf server software such as CGI libraries and gateways should be selected with care. Read the release notes for the software and regularly check the Web "home page" for the server software for information about security problems and new versions. Where possible, use the most recent stable version of the software. Don't be tempted to implement "beta-release" software by the promise of new features. Security vulnerabilities are often found in "beta" software-vulnerabilities that are fixed in the production release. Subscribe to any Web-related mailing lists and security bulletin boards where server security problems are discussed.

Restrict Server Access to Trusted Network Hosts

If the intended audience for your Web site uses a specific set of machines, perhaps within one organization, it may be possible to restrict access to your Web server to allow connections from only those machines. This can be achieved through the scoping features of certain Web servers, using "access.conf" or ".htaccess" files, for example. The same or better protection can also be provided at the network level by using TCP/IP wrapper software or router access control lists.

Restrict Access to CGI Functionality

On Web servers that do not make use of CGI gateways, the administrator should disable the CGI functionality altogether. If CGI gateways are needed, the administrator can often restrict the CGI functionality to a specific part of the Web site or deny CGI functionality to all but the trusted users. This may mean that CGI access is allowed from browsers within your organization to the CGI development area, but World Wide Web users on the Internet as a whole are allowed access only to tested and trusted CGI gateways.

Examine CGI Code, Especially Freely Available CGI Packages

Administrators should take the time to carefully read the source code and release notes for CGI gateways before they are installed. This advice applies not only to CGI programs developed in-house, but also to freely available CGI code. Security vulnerabilities are often discovered in public domain server software after it has been released. An administrator should also follow the Internet mailing lists and Usenet newsgroups that discuss the software concerned for news of possible security problems.

Run CGI Programs in a Protected Environment

If the operating system or Web software on your server will allow, ensure that CGI programs are run in a protected environment. On multiuser operating systems, set up your server to run as a nonprivileged user, preferably a user specifically for that purpose. Under networked operating systems, there may already be a nonprivileged account known as "nobody," but where possible use a different account specifically for running CGI programs.
If possible, run the CGI program in a virtual emulated machine, or in a subsection of the server's file system so that it cannot see the rest of the server files.

Caution
Do not run the Web server with super-user privileges. If a malicious intruder finds a security vulnerability in the Web server software, that intruder will immediately have full control of the server. Avoid CGI wrapper software that gives CGI programs the same privileges as the author of the CGI script. Such a wrapper merely lends extra power to the intruder who exploits any CGI security vulnerabilities. If a CGI gateway cannot access information as an untrusted user, this should prompt the implementor to reassess the availability of the information, not the privileges of the CGI gateway.

Run CGI on a "Sacrificial" Machine, Outside any Firewall

Choose a machine to be the CGI server that does not hold any secure information and that is not generally trusted by other network hosts. This need not even be your main Web server; setting aside a machine exclusively as a CGI server simplifies the security problem. If your organization uses a firewall gateway or router, position your CGI server outside this firewall to limit the advantages of an intruder who succeeds in exploiting a CGI security hole. Do not host the CGI scripts on the firewall gateway itself, because a security infiltration could compromise the whole organization.

Run CGI with Low Scheduling Priority

If possible, set the priority of CGI programs lower than other processes in a multiprocessing environment. This will limit the damage caused by malicious or accidental floods of CGI requests that might otherwise have disabled the CGI server.

Regularly Read Security Mailing Lists and Usenet Newsgroups

Security vulnerabilities and improved versions of server software are often announced and discussed in the Internet discussion groups. Information from software suppliers and other users can be invaluable.

What Can a CGI Programmer Do to Improve Security?

The author of a CGI gateway program can also do much to defend against security breaches, as outlined in the following sections.

Work with the Server Administrators

The programmer should discuss with Web server system administrators all CGI-based implementations and security risks. Together they can establish codes of practice that reduce the risk from attacks. They could also implement a process of peer review through which programmers review each other's code for possible security vulnerabilities.

Use Well-Respected Library Software

When choosing CGI toolkits and library software, examine and test them for possible security vulnerabilities. Read the release notes for the software and regularly check the Web "home page" for the library software for information about security problems and new versions. Where possible, use the most recent stable version of the software. Don't be tempted to implement "beta-release" software by the promise of new features. Security vulnerabilities are often found in "beta" software-vulnerabilities that are fixed in the production release. Subscribe to any Web-related related mailing lists and security bulletin boards where CGI security problems are discussed.

Restrict Access by Client Hostname

If the intended audience for your application uses a specific set of machines, perhaps within one organization, it may be possible to restrict access to your CGI gateway to allow connections from only those machines. This can be achieved either through the scoping features of certain Web servers, or by checking the REMOTE_HOST environment variable.

Restrict Access by Using HTTP Passwords

If you have a small number of known users for your application and the Web server you are working with has support for HTTP password authentication, you might choose to implement a username and password scheme to restrict access to the CGI gateway to the trusted set of users. This is not a substitute for careful coding, but it allows the programmer to put less emphasis on defending against malicious attacks or unauthorized use.

Be Paranoid

If you are writing software for any public network service, it is safest to believe that they are out to get you. Even if you consider the data you are handling to be public and your organization to be unattractive to crackers, remember that there are groups of people on the Internet who derive all their self-actualization from finding the security holes in your software, gaining unauthorized access to your computers and disrupting your network service, wasting the time and effort of you and your colleagues. Program defensively.

Make No Assumptions

It is dangerous to make assumptions about the data that will be presented to a CGI program by the Web server.
Beware of assuming that the data is a submission from your form. Anyone can point a Web form at your CGI gateway, or generate an HTTP request that looks like a form submission but contains unsafe data.
The example of an attempt to use Listing 9.1 (formmail.cgi) to crack system security would probably be made from a raw, interactive HTTP connection opened by the cracker. It supplies an unexpected value to a form field that the form designer probably intended to be hidden from the Web user.
Beware of assuming that the data submitted is small enough to fit where you want it to fit. Whatever limitations you include in a Web form, a faulty Web browser or a wily cracker will easily get around them and attempt to crash or abuse your system by sending more data than you expected.
Beware of assuming that special characters in the data have been encapsulated by the browser using the %hh hexadecimal escape sequence. Browsers may not implement this convention, and crackers may easily circumvent it.

Choose What Input to Accept, Not What to Reject

Many discussions of CGI security attempt to address the problem of characters in the query or submission that have special meanings.
Command interpreters and other simple interpreted languages are the most common victims. Characters like backquote ("`"), backslash ("\"), and dollar ("$") are interpreted as part of the interpreted language and can be exploited to trick the CGI gateway into running commands for a cracker on the Web server.
Other tools and even the operating system itself can be abused. Some useful applications will execute arbitrary commands if given the wrong input. ASCII control characters (those with decimal codes less than 32) can be used to disrupt text files where user supplied form or query data is logged.
Unfortunately, the most common defense is to try to compile a list of special characters and to guard against or exclude only those characters. This piecemeal approach is risky at best. The lists, like politician's speeches, are more interesting for what they omit than for what they include and are typically stripped from queries or form submissions as they have special meanings to command interpreters. Recently, several Web servers had to be rewritten to also defend against the inclusion of the "end-of-line" characters in search queries as these are considered special by many operating system operations.
A more satisfactory defense is to reduce the submitted input to a small set of acceptable characters. This set of characters will vary from application to application. For instance, a person's name could be restricted to upper- and lowercase letters (including the accented letters in the upper-half of the ISO-Latin-1 character set), spaces, hyphens, and apostrophes. With this analysis, the programmer will immediately discover that it is not possible to pass a person's name to a command interpreter wrapped in single quote characters because the single quote character (or apostrophe) can reasonably form part of someone's name.
The key technique is to choose a set of characters to accept, not to choose a set of characters to reject. The choice of acceptable input characters will be influenced by the intended use. If the input is to be passed as part of a command to the operating system command interpreter (as in Listing 9.1), programmers must find out whether any of the characters they would like their CGI program to accept have a special meaning to the operating system or to the other command.

Program Defensively

Choose criteria by which you can validate the query or form submission. For instance, if the user has been asked to supply an Internet e-mail address, reject a submission that does not conform to the relevant Internet standards. You may even choose to validate the supplied e-mail address by sending a secret password to the address and insisting on the password for future submissions. Be prepared to have your program handle garbage input, empty input, random input, prank submissions, and malicious attacks.
Choose limitations on the size and structure of acceptable input. It is easy to assume that a prompt for a name will yield a response small enough to fit into the available memory of the Web server, but there is no reason why it should. A malicious attacker could send several megabytes of binary data where you expected a personal name. Careless handling of a "denial of service" attack like this could lead to Web server downtime or even software damage. If you requested a single line of text, reject submissions containing end-of-line characters. If a Web form includes selection lists or checkboxes, reject any data submitted that is not formed from the options presented to the user.

Never Allow User Data to Be Reinterpreted

The data supplied by a user in a form submission or query should be treated as "contaminated" until it has been cleaned of potentially dangerous special characters.
The example program in Listing 9.1 passed data submitted from a form unchecked to the operating system command interpreter and to an e-mail application. A cracker suspecting this could have easily included quote characters in the form submission that could direct the command interpreter to run any command the cracker chose. The cracker might equally have chosen to exploit the e-mail application that might be similarly persuaded to run commands with the use of an escape character or exclamation mark.
The program might have been more safely written in the manner shown in Listing 9.2.

Listing 9.2. A more secure HTML form handler.
#!/usr/local/bin/perl
# formfile.cgi
# Accepts form submission and logs to a file for later use

# Call library routine to translate and split form submission
# into perl variables $input{"field"}
# the library routine limits the size and content of the input
# to a length and to characters considered safe
require "safecgi.pl";

# Open the log file for "append". Do not pass the form contents to any operating Âsystem routine
open (FILE, ">>/home/webweaver/form.log");

# Write some key headers for this message

print FILE "Script:".$ENV{"SCRIPT_NAME"}."\n";
print FILE "Host: ".$ENV{"REMOTE_HOST"}."(".$ENV{"REMOTE_ADDR"}.")\n";
print FILE "Date: ".`/bin/date`;

# And write the form data into the file
print FILE $input{"formcontents"}."\n";

close(FILE);
exit(0);

In the program in Listing 9.2, no user data is passed to be reinterpreted by an operating system command or any other program. It is simply written to a file for examination by a "safe" file browser later. The user data never contaminates any operating system command or operation.
CGI programming languages that permit the reinterpretation of variables as program code, such as scripting languages and command interpreters with an eval function, pose the extra problem of user data potentially contaminating the CGI program itself. Care should be taken to avoid passing unchecked user data to any interpreter, explicitly or implicitly.
Some programming languages include features to make the tracking of unchecked or "contaminated" data easier. For instance, the Perl scripting language supports "taint" checking, which helps to identify unchecked data before the program is used. Nevertheless, for most applications, the programmer should attempt to design a clear demarcation between unchecked and validated user data. This might be a variable naming scheme, perhaps where the unchecked data is kept in variables whose names begin with the word "raw" and are transferred upon validation and safety checking to variables beginning with the word "cooked." Alternatively, it might be a logical demarcation in the program's structure where the raw data is available only in the routines that accept the user input and is passed to the rest of the program after rigorous checking.

Check Array Bounds Aggressively

Many interpreted programming languages have inherent limitations in the size of some variable data types. It is also difficult to handle data of an arbitrary size in many compiled languages. For some tools and applications, the programmer will accept the risk of choosing a maximum reasonable size for user-supplied data and might not even check that the user-supplied data is small enough to fit in the storage space set aside for it.
When the application is being made available to anyone and everyone on the Internet, array bounds checking cannot be ignored. It is important that the CGI programmer chooses reasonable limits for the size of the expected input and checks that the programming system being used cannot accommodate those sizes. The programmer must then ensure that any user-supplied data larger than that limit is rejected or ignored. Dynamically allocating as much memory as the user data would fill runs the risk of exhausting the memory of the Web server to the detriment of the Web service. Allowing user-supplied data to over-run a fixed buffer size can cause operating system crashes or can even be exploited to gain unauthorized access to the Web server itself. Recently, crackers have successfully abused poor array bounds checking in Web server software to substitute their own executable program code for the server code in memory.
This is a particular problem if a cracker is able to trick the Web server into delivering the CGI program itself as a Web document rather than its results. The cracker can then "reverse-engineer" the program to determine its weaknesses. Web server software often announces the hardware and operating system platform on which it is running, and Web sites sometimes include this information in Web pages. If a cracker knows what platform the Web server is running under, the cracker can exploit these vulnerabilities more easily.

Never Vary the Path of Execution According to User Data

A CGI gateway is likely to be more secure if it behaves like a pure filter, that is if it does not do different things with different user-supplied data. If there is one normal execution path through the CGI code, it is much easier to track which user data has been validated and which is still "contaminated." The CGI program is simply a filter. If the program takes different execution paths depending on the data supplied, there are many more possibilities to test. In this latter case, the CGI program is behaving like an interpreter, and the canny cracker may be able to construct input that has side effects the programmer could not anticipate due to the complexity of the program.

Avoid Passing User Data to Other Programs

The security vulnerabilities in the program in Listing 9.1 were mainly associated with passing the user supplied data to other programs. To launch the mail application, the CGI gateway implicitly used a command interpreter in the following statement:
open (MAIL, "|/bin/mail -s ' ".$input{"formname"}." ' webweaver");
Part of the form data is included in the command. A cracker could have included any data in the form, including the character sequences necessary to cause the command interpreter to run any command the cracker wishes. Then the rest of the form data is passed as input to the mail application. Also, no allowance is made for the possibility that some input to the mail application could cause arbitrary commands to be launched.
The simplest way to avoid this kind of security vulnerability is to never pass the user data to any other programs. The CGI program in Listing 9.2 demonstrates this approach. Rather than using the mail command, the form data is simply logged to a file. A CGI programmer who is accustomed to the toolkit approach of calling many other utilities as modules in a program must either design a simpler self-contained pure filter or learn what the various utilities do when given any arbitrary input.

Clean Up User Data before Passing It to Other Programs

If the CGI gateway simply must pass the user-supplied data onto some other program, the gateway should first rewrite the dangerous characters in the data to prevent any undesirable side-effects. The programmer must choose a set of characters or an input language that will always have the expected effect in the auxiliary program and then force the user-supplied data into this form. However, in doing so the programmer must not introduce any extra security problems by reinterpreting the user data in the current program. Command scripting languages pose a particular problem here, as it is difficult to refer to the raw CGI environment variables without reinterpreting them in the context of the scripting language. Listing 9.3 is an example IMAGE MAP script that demonstrates the problem.

Listing 9.3. An insecure IMG ISMAP handler.
#!/bin/sh
# Clicking on the map.gif image sends the pixel coordinates as x,y
# in the QUERY_STRING environment variable

# Check for valid coordinates

if echo $QUERY_STRING | egrep '^[0-9][0-9]*, [0-9][0-9]*$' >/dev/null
then
# Send a magnified portion of the image
  echo "Content-type: image/gif"
  echo ""
  zoom $QUERY_STRING map.gif
else
# Send an error message
  echo "Content-type: text/html"
  echo ""
  echo "Picture Zoom Error: Invalid pixel coordinates passed"
fi

Observe in Listing 9.3 that the user query in the QUERY_STRING environment variable is expanded as part of the command
echo $QUERY_STRING
and could have undesirable side-effects if it contained special characters.
A safer implementation would be the script in Listing 9.4:

Listing 9.4. A more secure IMG ISMAP handler.
#!/bin/sh
# Clicking on the map.gif image sends the pixel coordinates as x,y
# in the QUERY_STRING environment variable

# Check for valid coordinates

if /bin/env >/dev/null 2>&1 && /bin/env | /bin/egrep '^QUERY_STRING=[0-9][0-9]*, Â[0-9][0-9]*$' >/dev/null 2>&1
then
# Send a magnified portion of the image
  /bin/echo "Content-type: image/gif"
  /bin/echo ""
  /usr/local/bin/zoom $QUERY_STRING map.gif
else
# Send an error message
  /bin/echo "Content-type: text/html"
  /bin/echo ""
  /bin/echo "Picture Zoom Error: Invalid pixel coordinates passed"
fi

The program in Listing 9.4 does not use the environment variable in a command until it has been safely checked by parsing the output of a command /bin/env, which dumps the whole set of environment variables without reinterpreting them. The first invocation of /bin/env is to ensure that the command does not find any unexpected problems with the environment variables such as unsupported variable names. The second invocation passes the user supplied data to a format checker without passing it through the command interpreter. This technique is not completely safe. It assumes that the /bin/env command will terminate with an error if any environment variable contains an "end-of-line" character or some other control code. However, not all systems have this capability.
Writing code that checks for individual dangerous input characters on a case-by-case basis is difficult to maintain and test. Writing a general reusable validator is a good investment. Something like the C procedure in Listing 9.5 can be used again and again. It takes as its arguments two pointers to null-terminated character strings and returns the first pointer with its contents rewritten to remove any characters not in the second string.

Listing 9.5. Stripping unwanted characters in C.
#ifndef MAX_UchAR
# define MAX_UchAR (255)
#endif

typedef unsigned char uchar;

char *stripchrs(char *string, const char *chrs) {
  char acceptable[MAX_UchAR], *chr, *pos;
  int chrnum;

/* Build a 256 entry table of flags for whether a particular character */
/* is acceptable or not. */
  for (chrnum=0; chrnum< MAX_UchAR; chrnum++) acceptable[chrnum]=0;
  for (chr=chrs; chr && *chr; chr++) acceptable[(uchar)*chr]=1;
/* Step through the string copying only acceptable characters */
  for (chr=string, pos=string; chr && *chr; chr++) {
    *pos=*chr;
     pos+=acceptable[(uchar)*chr];
  }
  *pos='\0';

  return(string);
}

Never Highlight Security Holes within the Program

Even if you are aware of a potential security vulnerability in a program, do not annotate the program with a comment describing the security hole. Many Web servers can be fooled into delivering the CGI program itself as a Web document instead of running it. A comment in the code is a gift to the potential cracker.

Think Like a Cracker-Try to Find Holes in Your Own Software

When writing your CGI program, follow the paths that the user-supplied data takes through the program and check that no user-supplied data influences the running of the server until it has been rendered harmless.
When testing your CGI program, try to think of ways to break the program. Send it garbage input, input that contains special characters that attempt to execute commands on the server, input that is much longer than usual, empty input, and even random input. Check what happens if two instances of your CGI program run in parallel.

Summary

The main things to remember from this chapter are as follows:
  • CGI is not itself insecure, but it is easy to make CGI programs insecure
  • Security is the joint responsibility of both programmer and administrator
  • Use well-respected server and CGI software
  • Restrict access to CGI service to trusted network hosts
  • Restrict access to CGI functionality to trusted users, locally and remotely
  • Examine CGI code, especially freely available CGI packages
  • Run CGI programs in a protected, unprivileged environment with low scheduling priority
  • Run CGI on a "sacrificial" machine, outside any firewall
  • Regularly read security mailing-lists and Usenet newsgroups
  • Assume nothing about user input
  • Choose what input to accept, not what to reject
  • Program defensively
  • Beware of reinterpreting or passing user-supplied data unchecked to other programs
  • Check array bounds
  • Never vary the path of execution according to user data
  • Clean up user data before passing it to other programs
  • Never highlight security holes within the program
And most importantly:
  • Be paranoid!