Table of Contents
Previous | Next
Apache Server Survival Guide

Chapter 13: Web Accounting


Web Accounting


Soon after a Web site is up and running, you will get many requests for Web traffic statistics. Log administration in terms of providing accounting information, "Web accounting," will be one of the primary services you'll be involved with after your Web site is up and running.

The importance and relevance of any Web accounting information depends on what you are going to do with it. If you are not going to do anything with this information, then don't even enable it!

However, Web accounting allows you to create a database of information that you can use in many aspects of your Web site administration. Even when deciding to implement a browser-specific feature, you may be interested in knowing how many visitors will be able to take advantage of the new feature. Logging provides the answers to this question and others, including the following:

All this information can be very useful to you as an administrator of the site and to the people responsible for the content. This accounting information can provide immediate feedback as to how your site is being accepted by the Internet community. If the service is useful, you will want to know which sections are more attractive to visitors and which areas need improvement. This information gives you the opportunity to modify and tweak your site to make it more responsive to your visitors needs.

Sites that thrive on traffic will also be very interested in the traffic patterns because the cost of advertising could be rated according to the access patterns for a page. Instead of having a flat rate, you could develop a random ad banner. Ads could be targeted to match the profile of your visitors. The type of ad that appears can be dependent on factors such as the time of day, where the visitor is coming from, and so on. The possibilities are many. Chapter 5, "CGI (Common Gateway Interface) Programming," develops a program that you can use for implementing random banners.

Over time, your traffic information will grow to provide sufficient statistical information that depicts interesting patterns such as the resources most frequently requested, the peak times and days for server access, and the way that people travel from page to page.

As an administrator, your main interests will probably be centered around the overall traffic generated in terms of transfer rates. How much is requested will impact the overall performance of your server and network. The monitoring of the error logs should also be important. The error logs will provide you with information about broken links, security violation attempts, and problems related to your CGI programs. If you decide to log, you'll also have to deal with the physical management of the log files. They grow, and they grow fast!

Monitoring logs is an important task because it provides you with vital information and also acts as an indicator of the proper operation of your site.

Monitoring your site's traffic can be accomplished with many tools. Some of which are already built into your system. This chapter explores different ways that you can sift through the information and summarizes the results into useful information.

Apache Standard Access Logs (Common Logfile Format)


The Apache server provides several logging modules that will help you keep track of many things. The standard logging module, called mod_log_common, is built into Apache by default. This module logs requests using the Common Logfile Format (CLF).

Starting with Apache 1.2, the default logging module will be via mod_log_config, a fully configurable module. mod_log_config is explained in the "The mod_log_config Module" section later in this chapter.

The CLF is used by all major Web servers, including Apache. This is a good thing, because it means that you'll be able to run several log analysis tools that are both freely and commercially available for this purpose.

The CLF lists each request on a separate line. A line is composed of several fields separated by spaces. Fields for which information could not be obtained contain a dash character (-). Each log entry uses the following format:

host ident authuser date request status bytes

Fields Available in the CLF


Here's a list of the data each field contains:
host The host field contains the fully qualified domain name or IP, if the name was not available, of the machine that made the request.


From a performance standpoint, you should not force your server to perform a reverse Domain Name System (DNS) lookup of the client. Some of the logging tools I'll describe can perform this reverse lookup at the time you create your reports. Apache 1.1.1 ships with a little support utility called, logresolve, which will obtain this information from the IP address stored in the log.

ident If IdentityCheck is enabled and the client machine was running an identity daemon, the ident field will contain the name of the user that made the request. You should never trust this information, unless you know that the host is trusted. Otherwise, understand that this information can be spoofed and is not trustworthy, don't bother enabling it!
authuser If the request required authentication, the authuser field will contain the login of the user who made the request.
date The date field contains the date and time of the request, including the offset from Greenwich Mean Time. The date format used is day/month/year:hour:minute:second timezone
request The request field is set to the actual request received from the client. It is enclosed in double quotes (").
status This field contains the three-digit HTTP status code returned to the client. Apache can return any of the following HTTP response codes:

200: OK

302: Found

304: Not Modified

400: Bad Request

401: Unauthorized

403: Forbidden

404: Not Found

500: Server Error

503: Out Of Resources (Service Unavailable)

501: Not Implemented

502: Bad Gateway

The HTTP standard defines many other codes, so this list is likely to grow as new features are implemented in Apache.
bytes The size of the transfer in bytes returned to the client, not counting any header information.

Enabling Logging


To enable logging using the standard log format, use the TransferLog directive. This directive allows you to specify the filename to receive the logging information. Instead of a file, you can also specify a program to receive the information on its Standard Input stream (stdin).

The syntax of the TransferLog directive is as follows:
Syntax: TransferLog [filename] | [|program]
Default: TransferLog logs/transfer_log

filename is the name of a file relative of ServerRoot. If for some reason you don't want to log, specify /dev/null as the access log file.

|program is the pipe symbol (|) followed by a path to a program capable of receiving the log information on stdin.

As with any program started by the server, the program is run with the User ID (UID) and Group ID (GID) of the user that started the httpd daemon. If the user starting the program is root, be sure that the User directive demotes the server privileges to those of an unprivileged user such as nobody. Also, make sure the program is secure.

Here's a sample from an accesslog file generated by Apache for http://www.PlanetEarthInc.COM, a site hosted at accessLINK:

sundmz1.bloomberg.com - - [20/Jul/1996:09:56:03 -0500] "GET /two.gif HTTP/1.0" 200 2563
sundmz1.bloomberg.com - - [20/Jul/1996:09:56:03 -0500] "GET /three.gif HTTP/1.0" 200 4078
sundmz1.bloomberg.com - - [20/Jul/1996:09:56:03 -0500] "GET /four.gif HTTP/1.0" 200 4090
pn3-ppp-109.primary.net - - [20/Jul/1996:09:57:29 -0500] "GET / HTTP/1.0" 200 5441
pn3-ppp-109.primary.net - - [20/Jul/1996:09:57:36 -0500] "GET /images/ultimate.gif HTTP/1.0" 200 7897
pn3-ppp-109.primary.net - - [20/Jul/1996:09:57:38 -0500] "GET /sponsors/banner-bin/emusic2.gif HTTP/1.0" 200 8977
pn3-ppp-109.primary.net - - [20/Jul/1996:09:57:44 -0500] "GET /images/hero.gif HTTP/1.0" 200 16098
128.58.101.231 - - [20/Jul/1996:09:59:19 -0500] "GET / HTTP/1.0" 200 5441
128.58.101.231 - - [20/Jul/1996:09:59:23 -0500] "GET / HTTP/1.0" 200 5441
slip-2-28.slip.shore.net - - [20/Jul/1996:10:03:44 -0500] "GET / HTTP/1.0" 200 5439
slip-2-28.slip.shore.net - - [20/Jul/1996:10:04:07 -0500] "GET /sponsors/banner-bin/books.gif HTTP/1.0" 200 5726
slip-2-28.slip.shore.net - - [20/Jul/1996:10:04:09 -0500] "GET /images/ultimate.gif HTTP/1.0" 200 7897
slip-2-28.slip.shore.net - - [20/Jul/1996:10:04:16 -0500] "GET /images/hero.gif HTTP/1.0" 200 16098
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:09:38 -0500] "GET / HTTP/1.0" 200 5441
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:09:50 -0500] "GET /anim.class HTTP/1.0" 200 12744
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:00 -0500] "GET /one.gif HTTP/1.0" 404 -
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:01 -0500] "GET /two.gif HTTP/1.0" 200 2563
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:05 -0500] "GET /three.gif HTTP/1.0" 200 4078
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:09 -0500] "GET /four.gif HTTP/1.0" 200 4090
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:12 -0500] "GET /five.gif HTTP/1.0" 200 3343
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:15 -0500] "GET /six.gif HTTP/1.0" 200 2122
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:18 -0500] "GET /seven.gif HTTP/1.0" 200 2244
slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:11:06 -0500] "GET /eight.gif HTTP/1.0" 200 2334
www-j8.proxy.aol.com - - [20/Jul/1996:10:31:50 -0500] "GET / HTTP/1.0" 200 5443
www-j8.proxy.aol.com - - [20/Jul/1996:10:31:57 -0500] "GET /images/ultimate.gif HTTP/1.0" 200 7897
www-j8.proxy.aol.com - - [20/Jul/1996:10:31:57 -0500] "GET /images/hero.gif HTTP/1.0" 200 16098
www-j8.proxy.aol.com - - [20/Jul/1996:10:31:57 -0500] "GET /sponsors/banner-bin/ktravel.gif HTTP/1.0" 200 1500
sage.wt.com.au - - [20/Jul/1996:10:43:05 -0500] "GET / HTTP/1.0" 200 5441

By simple inspection of this log excerpt, you can see that most requests are answered successfully. Only one entry is suspicious:

slip-12-16.ots.utexas.edu - - [20/Jul/1996:10:10:00 -0500] "GET /one.gif HTTP/1.0" 404 -

It has a response code 404 - "Not Found." The person maintaining this site should check to see if this error is repeated elsewhere because one of his pages could be referencing a broken link.

In addition to the standard mod_log_common logging module, Apache provides a log that is fully customizable. This log module is still considered experimental as of release 1.1, but according to some sources, it will be the preferred and default logging module for Apache 1.2. Even in its "experimental" state (actually it is just as reliable as the other one), its flexible log format may provide you with more useful logging capabilities and may give you the opportunity to reduce several logs into a single one.

Additional Logging Modules


While the default logging agent, mod_log_common, will be more than adequate for most needs, other logging modules may be useful to you. As of version 1.1.1, Apache included the following modules which added logging capability:

Of these four modules, the most important ones are mod_log_config and mod_cookies. mod_log_config is a configurable module that provides much more flexibility when logging information. mod_cookies also provides a log, but more importantly it enables the automatic generation of cookies from within Apache (these cookies should not be confused with Netscape persistent cookies. For information on Netscape persistent cookies, refer to Chapter 7, "Third-Party Modules").

Cookies are an unique identifier that get handed down by the server when a browser makes an initial connection. Because this identifier is guaranteed to be unique, you can use it to follow a user navigating through your Web site.

The mod_log_agent and mod_log_referer are compatibility modules for users of the NCSA 1.4 Web server that are migrating to Apache. mod_log_agent logs the client (browser) that was used to access the resource. mod_log_referer logs the Web site the user is coming from. This last piece of information is useful to help you determine where users are coming from and where your site is referenced. The referer module is also useful for tracking stale links that refer to resources that have since moved or have been removed from your site. mod_log_config allows logging of the same information provided by mod_log_agent and mod_log_referer except that instead of logging into several different files, the information can be consolidated into one log. mod_log_config will help you reduce the complexity of log analysis scripts you develop, while at the same time produce logging information that is compatible with CLF.

Apache 1.2 will introduce new logging capabilities, including the ability to redirect errors from your CGI to a logfile—an enhanced user tracking cookie based log module—and the ability to have multiple configurable log files.

The mod_log_config Module


The mod_log_config module is not built into Apache by default. In order to use any of its directives, you need to reconfigure Apache to include this module and comment out the standard mod_log_common from the list, and recompile the server.

This module allows the logging of server requests using a user-specified format. It is considered experimental; however, in my experience it works great. As previously mentioned, this will be the default logging module for Apache 1.2 and beyond. mod_log_config implements the TransferLog directive (same as the common log module) and an additional directive, LogFormat. Because both the mod_log_common and mod_log_config implement the TransferLog directive, I would consider it wise to only compile one or the other into Apache. Otherwise, your results may be unexpected.

The log format is flexible; you can specify it with the LogFormat directive. The argument to the LogFormat is a string, which can include literal characters copied into the log files and percent sign (%) directives like the following:
%h Remote host.
%l Remote logname (from identd, if supplied).
%u Remote user (from auth; may be bogus if return status (%s) is 401).
%t Time of the request using the time format used by the Common Log Format.
%r First line of request.
%s Status. For requests that got internally redirected, this is the status of the original request; %>s for the last.
%b Bytes sent.
%{Header}i The contents of Header: header line(s) in the request sent to the client.
%{Header}o The contents of Header: header line(s) in the reply.

One of the better features this module produces is conditional logging. Conditional logging can include the information depending on a HTTP response code. You can specify the conditions for inclusion of a particular field by specifying the HTTP status code between the % and letter code for the field. You may specify more than one HTTP status code by separating them with a comma (,). In addition, you can specify to log any of the environment variables, such as the User-Agent or the Referer, received by the server by specifying its name between braces ({variable}). Here are a few examples:

%400,500{User-agent}i

The preceding example logs User-agent headers only on Bad Request or Not Implemented errors.

You can also specify that a field be logged. If a certain HTTP code is not returned by adding an exclamation symbol (!) in front of the code, you want to check for

%!200,304,302{Referer}i

This example logs the Referer header information on all requests not returning a normal return code. When a condition is not met, the field is null. As with the common log format, a null field is indicated by a dash (-) character.

Virtual hosts can have their own LogFormat and/or TransferLog. If no LogFormat is specified, it is inherited from the main server process. If the virtual hosts don't have their own TransferLog, entries are written to the main server's log. To differentiate between virtual hosts writing to a common log file, you can prepend a label to the log format string:

<VirtualHost xxx.com>
LogFormat "xxx formatstring"
...
</VirtualHost>
<VirtualHost yyy.com>
LogFormat "yyy formatstring"
...
</VirtualHost>

LogFormat


The format of the log is specified as a string of characters:
Syntax: LogFormat string
Default: LogFormat "%h %l %u %t \"%r\" %s %b" (same as the CLF)

You are free to specify the fields in any order you want. But for compatibility with the CLF, you may want to observe the order of the standard elements (as in the CLF specification):

host ident authuser date request status bytes

I like the following format, which provides a lot of useful information:

"%h %l %u %t \"%r\" %s %b %{Cookie}i %{User-agent}i %400,401,403,404{Referer}i"

In order to enable the Cookie header, we compiled in the mod_cookies. We also disabled the CookieLog by pointing it to /dev/null. There is no need to have a separate Cookie log when you can include this information in the main log.

To enable logging, you need to use the TransferLog directive:
Syntax: TransferLog [filename] | [|program]
Default: TransferLog logs/transfer_log

filename is the name of a file relative of ServerRoot.

|program is the pipe symbol (|) followed by a path to a program capable of receiving the log information on stdin.

As with any program started by the server, the program is run with the UID and GID of the user that started the httpd daemon. If the user starting the program is root, be sure that the User directive demotes the server privileges to those of an unprivileged user such as nobody. Also, make sure the program is secure.

The Error Log


In addition to the transfer (or access) logs, you'll want to keep a close watch on the error log. The location of the error is defined with the ErrorLog directive, which defaults to logs/error_log. The format of this log is rather simple, it lists the date and time of the error, along with a message. Usually you'll want to look for messages that report a failed access because that could mean that there is a broken link somewhere.

If you are debugging CGI, you will want to be aware that information sent by a CGI to the standard error stream (stderr) is logged to the error log file, which makes the contents of this file invaluable while debugging your CGI.

Apache 1.2 introduces the ScriptLog directive, which will send all stderr messages to the log file specified with it. However, at the time of this writing, I could not obtain additional information to fully document this directive. Please check the Apache site for the latest information on Apache 1.2.

If you want to keep a watch on any of your log files as the entries are added, you can use the UNIX tail command. The tail command delivers the last part of a file. tail has an option that allows it to remain listening in for new text to be appended to a file. You can specify this functionality by specifying the -f switch:

tail -f /usr/local/etc/httpd/logs/error_log

This will display any error entries as they happen. You can also use this command on the transfer log and have up-to-the-second information regarding any activity on your Web server. (For an even better activity report, take a look at Chapter 11, "Basic System Administration," for information on the Status module.)

Searching and Gathering


Now that you have your logs accumulating data, you may want to be able to quickly search them. UNIX comes with a wide range of tools that can easily search a large file for a pattern. Our examples use our richer log files. I used a CLF-file format that had extra information at the end. I used the mod_log_config module and specified a log format of

LogFormat "%h %l %u %t \"%r\" %s %b %{Cookie}i %{User-agent}i %400,401,403,404{Referer}i"

This log format adds the Cookie header (a number) associated with each request. It also logs the browser the visitor was using and the Referer header information if the request was bad.

This format allows you to pack a lot of useful information into a single log file while still remaining compatible with most, if not all, of the standard logging tools available. (The order of the first seven fields is the same as the CLF.)

Counting the Unique Number of Visitors


If you are interested in counting the number of visitors and you are logging the cookie information as in our example log format, it becomes a matter of just counting the number of unique cookies in our log file. On entering the site, each visitor is assigned a unique cookie by Apache. A cookie looks like the following:

Apache=##################

Each # character represents a number. Counting users becomes a matter of counting unique cookies. The following series of commands retrieves this information:

awk '{print $11}' logfile | sort | uniq | wc -l

The awk command prints the eleventh field in the file. Fields in the logfile are separated by spaces, so each space creates a field. Output containing only the cookies numbers is piped to sort, which will sort all the cookies in numerical order. The sorted output is piped to uniq, which removes duplicate lines. Finally, the thinned out list is sent to wc which counts the number of lines in the result. This number matches the number of unique visitors that came to your site. For more information on these commands, please consult your UNIX documentation.

Using grep to Determine the Origin of the User


Another tool that is very useful for extracting information from your logs is the grep program.

By issuing the simple command

grep ibm.com access_log

You can see all the requests that originated from the ibm.com domain. If you just wanted a count of the accesses that came from the ibm.com domain, issue a -c flag to the command:

grep -c ibm.com access_log

Daily Statistics


If you wanted command a quick count of all the hits that your site sustained on a certain date, say on July 19, 1996, type the following:

grep -c 19/Jul/1996 access_log

To count all hits that your site sustained on July 19, 1996, between 3 and 3:59 p.m. (15:00 hours, UNIX time is expressed in the 24-hour format), type the following:

grep -c "19/Jul/1996:15" access_log

Home Page Statistics


To count all accesses command to your home page, type the following:

grep -c "GET / " access_log

or

grep -c "GET /index.html " access_log

or

grep -c "~/username" accesslog

The sum of these two searches is the number of total accesses to your home page, assuming that your home page is at the root directory and it is named index.html. For private home pages, you should use the third option. Just replace username with the login of the user.

Searching the Error Log


You should frequently check your error logs. Of special interest are the following messages user not found or password mismatch. If you get many repeated lines with these errors, someone may be trying to break into your site.

Tools for Summarizing Information


While the command line is invaluable for creating on-demand reports that search for something very specific, there are many tools available that create nice reports that summarize most everything you want to know about your site's traffic. Many of these tools are free, and they all answer, in varying degrees of greatness, the following basic questions:

Your choice will depend on what type of output you like. These tools are available in two types: graphical and text. You can find free, shareware, and expensive versions of these tools. Shop closely, and look on the Net for the latest on these tools. A good place to search is

 http://www.yahoo.com/Computers_and_Internet/World_Wide_Web/HTTP/Servers/Log_ Analysis_Tools/

The higher-end tools, such as net.Analysis from net.Genesis ( http://www.netgen.com), cost anywhere from $295–$2,995. They provide a number of features that may be interesting to very high-traffic sites.

In the inexpensive range (less than $100), there are many nice tools with tons of options available. My favorite tools are described in the following sections.

AccessWatch


AccessWatch by Dave Maher, http://netpressence.com/accesswatch/, is a graphically appealing log analyzer. It provides information about today's Web access. The software is implemented as a Perl script, so it is portable to environments that run Perl. This software is free for the U.S. Government, noncommercial home use, and academic use. Any other use has a license of U.S. $40/year. Figures 13.1–13.5 show examples of AccessWatch reports.

Figure 13.1. AccessWatch daily access and predictions report.

Figure 13.2. AccessWatch summary report.

Figure 13.3. AccessWatch hourly access report.

Figure 13.4. AccessWatch page access report.

Figure 13.5. AccessWatch domain access report.

Wusage


Wusage by Thomas Boutell, http://www.boutell.com, is a powerful and appealing log analyzer that is distributed in binary form. There are versions for various types of UNIX and Windows. Wusage provides all the configuration commands you could possibly want. As a bonus, it comes with a friendly utility for configuring the program. It utilizes Boutell's gd Graphical Interchange Format (GIF) library to generate a variety of attractive charts. Its number one feature is that it lets you create reports based on a pattern, be it a filename or a site. It can also analyze multiple log files at once. The license price varies depending on the number of copies you purchase and the intended use. Single user licenses are $25 for educational and nonprofit institutions. All others are $75. Various Wusage reports are shown in Figures 13.6–13.10.

Figure 13.6. Wusage daily access report.

Figure 13.7. Wusage monthly access report.

Figure 13.8. Wusage hourly graph report.

Figure 13.9. Wusage domain access report.

Figure 13.10. Wusage top 10 document requests.

Analog


Analog by Stephen Turner with the Statistical Laboratory, University of Cambridge ( http://www.statslab.cam.ac.uk/~sret1/analog), is highly configurable and is fully HTML 2.0 compliant. It does have some graphics options, but I like the text reports, as shown in Figures 13.11–13.14. They are very comprehensive. Analog runs under UNIX, Macintosh, and DOS.

Figure 13.11. Analog's summary report.

Figure 13.12. Analog's monthly report and partially showing the daily report.

Figure 13.13. Analog's hourly summary report.

Figure 13.14. Analog's domain report.

wwwstat


Another popular utility is wwwstat, http://www.ics.uci.edu/WebSoft/wwwstat/. This utility, when coupled with gwstat ( ftp://dis/cs.umass.edu/pub/gwstat.tar.gz), produces nice graphic reports. However, the graphic system only works under the X11 (The X Windowing System). I also had trouble with it processing my 40MB log file on a very fast PA-RISC server, something that leads me to believe that there was a compatibility problem on my side. The other software packages previously mentioned quickly processed the 40MB log file in a minute or so without any problems.

Summary


Logging information provides you with interesting statistical information about visitors to your site. By carefully analyzing this data, you may be able to catch a glimpse on a trend. This information can be invaluable in terms of determining what visitors like to see in your site as well as pointing out the content you should enhance. Responding to these perceived needs will help you build a better site—one that is useful to your visitors and one that will attract more of them.

Using software tools to graph Web accounting information is better than looking at raw numbers. "A picture is worth a thousand words," and in this case, it will help you understand your site's traffic statistics better. Many of these reporting tools summarize information by time periods and access groups, which lets you see your information in several different ways. By associating this data with press releases and other media-dissemination information, you can gage customer interest in your products or services. Your traffic is your feedback.