The Concrete Architecture of the Apache Web Server(zhuan zai)

http://www.cs.ucsb.edu/~tve/cs290i-sp01/papers/Concrete_Apache_Arch.htm

The Concrete Architecture of the Apache Web Server

Octavian Andrei Dragoi ,
mailto:oadragoi@neumann.uwaterloo.ca

Jean Elizabeth Preston ,
je2prest@neumann.uwaterloo.ca

Department of Computer Science, University of Waterloo,
Assignment 2 for CS746G
February 9, 1999
Abstract:

This report gives a tour of the concrete architecture of the Apache web server (release 1.3.4). The goal is to provide support for anyone who wants to modify a subsystem, or add extra functionality.
The main components of the concrete architecture of the Apache server are the core and the modules.
This paper covers the details of Apache core architecture, the essential data structures with their uses, and gives an extended insight into the concrete architecture of a module. The concurrency approach employed in Apache is also detailed.
In general, anyone who wants to add extra functionality only has to write a new module. This usualy means providing one or more handlers (functions) for one of the phases of processing an HTTP request. In fact, even an important part of the Apache core has the "look and feel" of a module, although it is not a proper one (it shares information with other core sub-components).
The way a module's handlers are called is transparent to the module, and all communication with a module is done through pointers to functions. Because of this, fact extractors cannot capture the interaction between core and modules.
To extract the concrete architecture we have used a variety of sources: fact extractors (Portable Book Shelf), papers on Apache, read me files, and to a large extent, analyzing relevant parts of the source code.

Keywords:

Apache, concrete architecture, design, web server

Available online at:

http://www.grad.math.uwaterloo.ca/~oadragoi/CS746G/a2/caa.html

1. Introduction

A webserver provides users of a web a means of retrieving data, usually documents, through web browsers. The user clicks on a link in the browser and a request is sent from the browser to the server. The server retrieves the document from storage and returns it to the browser, which presents it to the user.

The Apache server divides the handling of requests into separate phases. These request phases are:

URI -> filename translation
- determine what information is wanted, and where in storage it is
Auth ID checking
- is the user who they say they are?
Auth access checking
- is the user authorized here?
Other access checking
- access checking other than authorization
Determining MIME type of requested object
- information can be in many different forms, including text, images, sound, or video. We will use the term "document" for any type of information available on the web.
Sending response to client
- what is returned to the client, or what action the server should take, is determined by the method specified in the request
Logging the request
(Apache Group)

In Apache, each phase is handled by a module or set of modules. Each module is looked at in succession, to see if it has a handler for the phase. This results in a flow of control and data that is similar to a pipeline.

Figure 1.Apache pipe-line of request handling phases

The above figure illustrates the conceptual movement of the data structure request_rec and the flow of control with the broken arrows. The process starts and ends in the core, where request_rec is created and where the cleanup is done after the request has been handled.

Actually, control moves from the core to each phase and then back to the core, as is shown by the solid arrow lines. As well, request_rec is first created by the core, then passed to each phase and back to the core in turn.

2. High level concrete architecture of Apache

The high level concrete architecture of the Apache web server is not very different from the high level conceptual architecture, in the sense that we still find the same splitting of functionality between the core and the modules. In addition to the Apache core and the modules we can identify some new small modules. Two of them (ap, regex) are essentially libraries of utility functions, used by both the core and the modules. The third one (os) is the one that ensures the independence of the operating system from the Apache core and the standard modules.

Figure 2. High level concrete architecture of Apache.

The next section presents in more detail each component of Apache.

2.1. Overview of Apache components

The directory structure of the Apache source code reflects a similar partition of the code into separate components. Almost all include files are grouped in a separate include/ directory, although in fact they define only the functions implemented by the Apache core (main/ subdirectory).

It is worth noting that all Apache functions, utility functions, wrappers and re-implementations are prefixed by ap_. This is a rule introduced in release 1.3 of Apache, in order to avoid the possiblity of conflicting names (without the ap_) in the 1.2 release. However there are header files that perform mappings between the functions, so modules written for the 1.2 release can be adapted easily.

main is the apache core, which implements the basic functions of a web server.
modules is a component containing different modules that are shipped with the Apache distribution. This includes a set of standard modules that extend and complement the Apache core.
os component encapsulates the functions strictly dependent on operating system and platform. In the source code tree the os/ directory contains directories for each specific platform supported. Platform is used here in the sense of software platform that offers a common programming environment. For example unix/ contains specific files for unix platforms. However, mainframes that work with proprietary OS that use EBCDIC character encoding instead of ASCII, different directories are used bs2000, tpf. OS/2 from IBM and Windows NT are also supported (os2, win32). The "link" between the other platform independent components and the platform dependent os component is the file os.h. It is included in all Apache source code. It declares the specific functions for that system. Such functions are, for instance, UNIX functions not available on a Windows environment. The implementation of such functions goes in os.c for regular functions and in os-inline.c if can be in-lined. functions and code strictly related to a platform should also go in the directory for that system. For example, the Windows code for storing configuration files into the registry (a repository of configuration information used by all applications in this operating system) is also part of the os component for Windows (win32/)
regex is a separate component, used as a library of general functions dealing with regular expression manipulation (e.g. splitting a string in tokens in an awk fashion). It is called from the module: main (alloc.c, http_core.c, http_request.c, util.c, util_uri.c)
ap is a component that defines function wrappers for functions with no unique behavior across platforms (e.g. strncpy which has different behaviors w.r.t. the trailing '\0' in the copied string), re-implementation for unstable library functions, and new utilities functions (e.g. formating functions for numbers, for Internet addresses, etc.).
Functions grouped in the module do not provide utilities specific for a web server. That is why they form a separate directory (component)
support is a separate component containing shell scripts and source code for helper programs for the Apache server administrator, such rotating log files to save space, manipulating password files, generating statics starting from the log files. So files in this modules are not part of the actual Apache server,
helpers contains shell scripts used as helpers by the compile time configuration routine for Apache.

2.2. Concrete description of the Apache core

All the files forming the Apache core are grouped in the main/ directory. The designers of Apache wanted as much functionality of the web server as possible to be implemented as separate modules and therefore there are many interactions between the sub-components of the Apache core. The idea is that someone extending Apache should not have to modify anything in the core. The only sub-component that might need changing in order to extend the server is the one that implements the HTTP protocol (which is part of the core). Although (on the good side) the HTTP protocol is a separate sub-component of the core, there is no well define API.

Figure 3. The concrete architecture of the Apache core.

2.3. The main server loop (`http_main.c`)

The http_main.c file contains code that starts up the server (i.e. the actual main()), the main server loop, code for managing children and code for managing timeouts.

The main server loop is the one that waits for a TCP/IP connection request, accepts it (i.e. establishes an TCP/IP connection), allocates a resource pool, reads the HTTP request from /IP stream, and calls the appropriate function in the http_request.c file to handle the HTTP request. After the request has been processed, it frees the resource pool and, eventually, closes the TCP/IP connection.

Figure 3 shows the interaction of this file with the other sub-components of the Apache core. This sub-component also controls the number of active child processes, through a special shared data structure (score board) that holds information on the status of each children. More on this data structure and the way Apache manages concurrency is presented in the data structures section and the section on concurrency.

2.4. The sub-component implementing the basic functionality of a web server(`http_core.c`)

The http_core.h file implements the most basic functions of processing an HTTP request. In a comment from a source file http_core.c it is described as being "just 'barely' functional enough to serve documents, though not terribly well".

This file could almost have been mod_core.c . In fact it defines a module structure as any module. http_core.c defines the command table for all the standard configuration commands. It also implements handlers for self-initialization and for some of the phases of the HTTP request cycle:

a very rudimentary URI to file name translation
"do-nothing" handler for MIME type checking
"do-nothing" handler for fix-ups
a very rudimentary default handler for all the MIME types (only delivery of a document as it is, with no support for all special operations defined by HTTP).

One reason why http_core.c is not a separate module is related to the legacy configuration commands (from NCSA web server) that Apache must implement. These commands, and in particular Options, are more powerful than the typical Apache configuration commands in the sense that they affect more than one module. In order to implement this kind of behavior http_core.c must have access to some Apache core data structures that are not accessible to ordinary modules.

2.5. The configuration sub-component (`http_request.c`)

This an important support sub-component in the Apache core. It is responsible for managing all the information related to modules and what they offer. It is also responsible for initializing modules and other sub-components of the Apache core (e.g. http_code.c). This includes parsing configuration files and invoking the appropriate commands in the command table advertised by the modules.

Another major function of http_config.c is walking through the link list of modules and invoking the appropriate handlers for a phase of the HTTP request handling cycle, when it is asked by the http_request.c sub-component. The rationale behind having this function here and not in http_request.c is that, in a way, http_config.c is the owner of the data structure that holds information on current modules and does all the book keeping related to this structure. It should be noted that http_config.c does not decide when a phase is invoked, it just does the implicit invocation on the appropriate handlers.

2.6. The request flow control sub-component (`http_request.c`)

The responsibility of the actual control of the order in which different handlers are invoked is on http_request.c.

An interesting feature of the http_request.h is the possibility of handling sub requests, which can be viewed as a sort of recursion in the flow of handling an HTTP request (e.g. while handling one phase a module can issue a sub request to convert an additional URI to a file name). More on sub-requests can be found in the section on data structures.

2.7. The protocol sub-component (`http_protocol.http, http_rfc1413.c`)

This component implements the actual HTTP protocol and manages the connection with the HTTP client. It ensures that the dialog with the client is performed by the HTTP protocol, converting information contained in the current request data structure in HTTP headers (text lines that advised the client on the characteristics of the entity delivered) and HTTP content (binary information representing the requested entity). This sub-component also takes care of correctly closing the HTTP connection in case of error, by advising the client on the error using HTTP protocol error codes.

The sub-component is called when the connection is established and closed, and at the appropriate times during the processing cycle of the HTTP request (for example when the document to be delivered must be written to the client).

2.8. Resources management sub-component (`buff.c, alloc.c`) and utilities sub-component (`util*.c`)

The resource management sub-component is formed by files alloc.h and buff.h. buff.c offers functions for buffered I/O and buffered character conversion, which replace similar library functions which have semantics that vary from platform to platform.

alloc.c implements the management of resource pools. A resource pool is a big memory pool which is used to allocate memory needed to process the current request. File descriptors are also allocated on the resource pool. The advantage of having a unique resource pool for each HTTP request is that memory and file descriptors can be freed at once, when the processing ends. This not only frees the programmer of a module to explicitly free each allocated resource, but also prevents resource leakage.

The Apache core component also contains a number of utility functions that are not of general interest outside the component.

2.9. Logging sub-component (`http_log.c`)

This sub-component exports functions that perform logging operations. It insures that logs are properly accessed (one message at a time) and that they are properly closed. It also defines a priority based logging system. A user of the sub-component can specify the importance of the message it wants to be logged, ranging from debug-level messages (least priority) to "system is unusable" messages (highest priority). This means the system/process has become highly unstable and the message should be logged immediately (e.g. before a crash happens).

2.10 Virtual hosts handling component(`http_chost.c`)

The server can respond to more than one name (i.e. www.example and www2.example), each assigned to one of the multiple IP addresses of the machine. The multiple IP addresses can be associated with physical network interfaces or with virtual network interfaces (simulated via logical devices by the operating system). Apache is able to "tell" which name the host has been referenced under and use different configuration options (e.g. allowing more access rights to users accessing the host through an interface networked in the local network, as opposed to users accessing the web server via an interface networked in the outside-the-company network). Modules also have accessed to this information.

3. Concrete description of an Apache module

The role of modules is to implement/override/extend the functionality of the Apache web server. All modules have the same interface to the Apache core, and they do not interact directly with each other. In fact they are not active entities, they just offer a set of handlers for the distinct phases of handling HTTP requests. These phases are "controlled" by the http_request.c component of the Apache core.

3.1. How to write a module?

Figure 4 gives a detailed overview of the architecture of a module. A module is a collection of exported functions (handlers) which are called by the Apache core. The glue between one module and the core is the module structure. Such a structure is defined by each module and is accessible by the core. The structure holds pointers to all handlers exported by the module.

First a module must be initialized (when it is loaded). As the core has no information on the internal structure of the module, the module must export a handler for this purpose.

The idea behind Apache is to be as versatile as possible, therefore each module can define its own commands to be used by users in configuration files. However, the core is the one that actually reads from the configuration files, so there must be a way to match a command with a module, or more precisely with the handler exported by the module to execute that command. This matching is done through the command table memorized in the module structure.

The purpose of a module, as have been said, is to add new functionality to the way the web server is servicing HTTP request. However the flow of executing a request is controlled by the core (http_request.c). A module is not required to export handlers for all the distinct phases of the HTTP request cycle. Figure 4 shows a module that implements handlers for all of the phases but in fact most of the modules will implement one or a few of them.

A special case of handler is the one for the phase that actually delivers the object to the client. Those handlers are called content handlers or response handlers. They are special because a module often implements several content handlers, one for each type of object the module knows how to deliver (e.g. a module might know how to deliver disk directories, but might also know how to deliver lists of people formated in the same way as directories). In order to discover which content handler must be called for the current request the core will use the content handler table in the module's modules structure. It consists of pairs of content type(i.e. MIME type) and a pointer to the handler. Of course a module can export the same content handler for more than one MIME type.

An important element to be taken into account when writing a module is that although handlers are invoked implicitly (or might not be invoked at all if handlers of other modules, for the same phase , inform the core that they have completed that phase) different handlers can communicate between data structures private to the module (usually static ones).

Figure 4. The detailed concrete architecture of a complex Apache module .

To summarize, writing an Apache module means providing functions (i.e. handlers), with a fixed prototype, that implement routines for initialization, configuration, and handling some of the phases of the HTTP request.

The most simple module is one that does not provide (i.e. does not need) any initialization or configuration handlers, does not define any custom configuration commands and implements only one handler (usually the content handler). A complex module (like the one in Figure 4) will implement all or most of the possible handlers.

4 Description of the Apache distribution modules

Apache comes with a set of modules grouped in the modules directory. The standard modules grouped in standard subdirectory are essential for the funtioning of Apache. An extension of Apache (in the form of a module) enables it to work as a web proxy. Additionally, the source contains a demo module that provides any single handler a module can implement. It is well commented and is located in example/ subdirectory.

4.1 Standard modules

In order to be fully functional, Apache installs by default a number of standard modules. As standard modules they are not dynamically loaded, although they could be. Modules can be statically linked with the core or can be compiled as separate dynamically loaded pieces. The statically linked modules are always pre-loaded at start-up. Dynamically loaded modules can also be pre-loaded.

When the configuration scripts are run a special .c file modules.c is automatically generated in the root of the source code. modules.c defines special arrays of pointers to module structures, called ap_prelinked_modules[], for the modules that are linked with the core, and module *ap_preloaded_modules[] for those that will be preloaded.

It is interesting that even mod_core, which is the module structure defined by the http_code.c sub-component is listed in the array generated in modules.c. The following list gives an overview of some of the standard modules. The mod_core (the pseudo module define in the apache core (subcomponent http_core.c) is also included.

For URI to file name translation phase:
- mod_userdir: translate the user home directories into actual paths mod_rewrite Apache 1.2 and up
- mod_rewrite: rewrites URLs based on regular expressions, it has additional handlers for fix-ups and for determining the mime type
For authentication / authorization phases:
- mod_auth, mod_auth_anon,mod_auth_db, mod_auth_dbm : User authentication using text files, anonymous in FTP-style, using Berkeley DB files, using DBM files.
- mod_access: host based access control.
For determining the MIME type of the requested object (the content type, the encoding and the language):
- mod_mime: determines document types using file extensions.
- mod_mime_magic: determines document types using "magic numbers" (e.g. all gif files start with a certain code)
For fix-ups phase:
- mod_alias: replace aliases by the actual path
- mod_env: fix-up the environment (based on information in configuration files)
- mod_speling: automatically correct minor typos in URLs
For sending actual data back to the client: to chose the appropriate module for this phase the mime type or the pseudo mime type (e.g. for a CGI-script) is used.
- mod_actions: file type/method-based script execution
- mod_asis: send the file as it is
- mod_autoindex: send an automatic generated representation of a directory listing
- mod_cgi: invokes CGI scripts and returns the result
- mod_include: handles server side includes (documents parse by server which includes certain additional data before handing the document to the client)
- mod_dir: basic directory handling.
- mod_imap: handles image-map file
For logging the request phase:
- mod_log_*: various types of logging modules

A summary of what each standard modules has to offer (i.e. what phases handles is given in the following. Note that a module can define handlers for more than one phase. Again mod_core has been included.

No	Phase	Modules
2.	filename translation	mod_alias.c mod_userdir.c mod_core
3.	check_user_id	mod_auth.c mod_auth_anon.c mod_auth_db.c mod_auth_dbm.c mod_digest.c
4.	check auth	mod_auth.c mod_auth_anon.c mod_auth_db.c mod_auth_dbm.c mod_digest.c
5.	check access	mod_access
6.	type_checker	mod_mime.c mod_mime_magic.c mod_negotiation.c mod_core
7.	fixups	mod_alias.c mod_cern_meta.cmod_env.c mod_expires.c mod_headers.c mod_negotiation.c mod_speling.c mod_usertrack.c mod_core
8	content handlers	mod_actions.c mod_asis.c mod_autoindex.c mod_cgi.c mod_dir.c mod_imap.c mod_include.c mod_info.c mod_negotiation.c mod_status.c mod_core
9	logger	mod_log_agent.c mod_log_config.c mod_log_referer.c

As can be seen from the above table the functionality of mod_core is indeed minimal. Also one can observed that a module tends to provide handlers for related phases (e.g authorization and authentication or MIME type check and content handlers), although this type of behavior is not a requirement.

4.2 The proxy module

The behavior of a web proxy is very similar to the one of a web server, at least as the client is concerned (i.e. follows the same protocol). Therefore the Apache has been easily extended to implement a proxy behavior by the means of a module. Being a more complex module it has been implemented as set of sub-components:

mod_proxy.h mod_proxy.c is in a sense the main files of this module since they define the module structure. The handlers implemented by this module are:
- URI to filename translation this phase is necessary because request might refer to file that are not on the local disk but o different machines.
- fix-up handler this only put the requested URL in a canonical for (so it can be used as a search key in the cache.
- content handler: this is implemented because the proxy does not simply delivers files from the local disk, it might contact an web server in order to obtain the file and it might save the file into a cache.
proxy_cache.c manages the cache implemented by the proxy module. The cache is an example of a private data structure that survive between implicit invocation to the handlers of the proxy module.
proxy_connect.c implements the code that connect this server to an web server. The proxy acts as a client for the web server and as a server for the HTTP client.
proxy_ftp.c, proxy_http.c implement utility routines specific for HTTP and FTP protocols.
proxy_util.c implements various routines that main deal with matching symbolic host name, host Internet address, etc.

5. Concurrency

Apache provides access to two levels of concurrency. Multiple server processes are forked at startup. The default number of these processes is five, and this is the minimum number of idle servers that Apache tries to maintain at all times. As well, if multi-threading is supported by the operating system, a default of up to 50 threads is allowed for each process.

Each request that the server receives is actually handled by a copy of the httdp program. Rather than creating a new copy when it is needed, and killing it when a request is finished, Apache maintains at least 5 and at most 10 inactive children at any given time. The parent process runs a periodic check on a structure called the scoreboard, which keeps track of all existing server processes and their status. If the scoreboard lists less than the minimum number of idle servers, then the parent will spawn more. If the scoreboard lists more than the maximum number of idle servers, which is by default 10, then the parent will proceed to kill off the extra children.

When it receives a request, the parent process passes it along to the next idle child on the scoreboard. Then the parent goes back to listening for the next request.

Figure 5.The parent passes a request to a child to be served

There is also a default limit of 256 on the total number of servers that can exist at one time. The authors of Apache provided this upper bound in order to keep the machine that the software is running on from being swamped by servers and crashing. The default was picked to keep the scoreboard file small enough so that it can be scanned by the processes without causing overhead concerns.

Since the number of requests that can be processed at any one time is limited by the number of processes that can exist, there is a queue provided for waiting requests. The maximum number of pending requests that can sit on the queue is 511.

5.1. Keep-alive or Persistent Connections

Apache uses the persistent connection to allow multiple requests from a client to be handled by one connection, rather than opening and closing a connection for each request. The default maximum number of requests allowed over one connection is 100. The connection is closed by a timeout.

6. Data Structures

opening line - some data structures that are central to the functioning of the Apache server

6.1. request_rec

Once a request has been read in, http_request.c is the code which handles the main line of request processing. It finds the right per-directory configuration, building it if necessary. It then calls all the module dispatch functions in the right order.

Figure 6. Calls to modules for the request processing cycle

When the module handlers are called, the only argument passed to them is request_rec. The pieces of this structure which are public to the modules allow them to learn what the request is and how it should be handled. Most of the handlers complete their part of the request cycle by changing some fields in the request rec. But the response handlers must actually return something to the client. Sometimes these handlers need to direct a server to return some other file instead of the one that the client originally requested. This is a redirected request.

Some handlers can farm out part of their job to another process in the form of a sub-request.

Request_rec can be a linked list if the request is redirected by a handler. The structure can contain pointers to the request_rec the request is redirected to and the request_rec it is redirected from. Or, if it is a sub-request, it can contain a pointer to the original request_rec.

6.2. Internal Redirected Requests

The response handler may find that the request to be served is better handled as another type of request. If the request is to an imagemap, a type map, or a CGI script, then the actual resource the user requested is in some other URI than the one originally used. In this case, the module's handler generates a new request and passes it to another process.

The handler invokes ap_internal_redirect, which initiates a new request_rec. The chain of redirects is placed in a list of request_recs which is linked by pointers. The results of the final response handler is passed back up the chain to the one that caught the original request, and is then sent back to the client.

6.3. Sub-requests

The sub_request mechanism allows a response handler to look up files and URIs without actually sending a response. This is done using the functions ap_sub_req_lookup_file or ap_sub_req_lookup_uri. These construct a new request_rec which is processed up to the point of the response.

6.4. Request_rec

*from http_request.h* the central tie that binds the system together - the data structure request_rec contains all the information that each module

6.4.1. Structure of request_rec

Here is a partial list of the fields that are contained in a request_rec.

pointers to other request_rec, as described above
pointer to resource pool
object being requested
- URI
- filename
- path
- status of information - e.g. set to zero if no such file
document content information
- content type
- encoding
MIME header tables - in, out, and error headers
information about the request
- protocol to use
- method - e.g. GET, HEAD, POST
information for logging
- bytes sent
- request description

6.5. compool

Rather than keeping track of which files are opened and where allocated memory is, and then explicitly tracking it all done to deallocate it, Apache uses the idea of resource pools. A resource pool is a data structure which keeps track of all allocations of finite resources that are associated with a request. When the request cycle is finished, all the resources held in the pool are released at one time.

This provides the advantages of garbage collection without the extensive code, and small amounts of space can be allocated without adding large amounts of record keeping.

One disadvantage of this method is that resources that are not being actively used cannot be released until the pool is cleared. This can create problems, especially with memory. So the modules can establish private resource pools that they can clear or destroy as they want.

6.6. Command Tables

The core's command table is held in http_core.c. - example - (Thau, pg 3)XS

Each module may have its own command table, which allows it to handle commands read from configuration files. The entries for each command listed in the table are:

the name of the command
a pointer to the command handler - This is a C function which processes the command.
an argument which is passed to the command handler - A command handler may process many commands.
where the command may appear -
the type and number of arguments the command takes
a description of the arguments that the command takes (Thau, pg 3)

6.7. Scoreboard

The scoreboard structure is used to keep track of the child processes. The information is kept brief, basically just the status value and the pid, the process id number. The creators of Apache have plans to add a separate set of longer score structures that will give the number of requests serviced, and data on the current or most recent request.

Each time a parent process spawns a child, a record is created for the child in scoreboard. When a child is killed, its record is removed from scoreboard. The status value of a process is written to scoreboard by the process itself. The parent process uses the status value of each child to determine if new children need to be created, or if there are too many idle processes.

The status values defined in scoreboard.h are the following:

 SERVER_DEAD 0
 SERVER_STARTING 1       /* Server Starting up */
 SERVER_READY 2          /* Waiting for connection (or accept() lock) */
 SERVER_BUSY_READ 3      /* Reading a client request */
 SERVER_BUSY_WRITE 4     /* Processing a client request */
 SERVER_BUSY_KEEPALIVE 5 /* Waiting for more requests via keepalive */
 SERVER_BUSY_LOG 6       /* Logging the request */
 SERVER_BUSY_DNS 7       /* Looking up a hostname */
 SERVER_GRACEFUL 8       /* server is gracefully finishing request */
 SERVER_NUM_STATUS 9     /* number of status settings */

6.8. Module Structure

The specific contents of a module are determined by the type of function the module performs. Figure 4 shows a generalized picture of a module.

7. Fact extraction and visualization using PBS

The source code has been analyzed with the Portable Bookshelf Tools. Beside a number of incompatibilities between the fact extractor and the C-source of Apache (that have been resolved by editing the source files, in such a way as not to modify the facts), the the extracted facts were not very useful in the analysis of the source code.

One of the main difficulties resides in the characteristics of the Apache source code, which defines a large number of macros, not only for data structures but also for procedures, their parameters and their return functions. This mislead PBS in many cases, to show as suppliers/users the .h files, when the actual suppliers/users were, in fact, the .c files (through macros).

As an example Figure 7 shows the result for the Apache core structure. The content of the sub-components is hidden in order to increase clarity. The arrows that point down (at the bottom of the picture), go to the utilities component described earlier (ap, regex, os).

The following table shows what has been grouped under each sub-system (file.* means both file.c and file.h, prefix*.h means all files with that prefix).

Sub-component	Files
http_main	`http_main.*, httpd.h`
protocol	`http_protocol., rfc1413.`
http_request	`http_request.*`
http_core	`http_core.c, http_core.h`
http_config	`http_config., http_config.global.h , ap_.h`
resources	`buff., alloc.`
util	`util..`
http_log	`http_log.*`
http_vhosts	`http_vhost.*`

Another major difficulty in using PBS, or any other automatic fact extractor is that there is no way to extract relations between the Apache core and the modules. All the calls and references are done through pointers to functions or data structures. These kinds of interactions are difficult, if not impossible, to extract at compile time. Even most of the interactions between the http_core.c and the rest of the sub-components of the Apache core are done through the same mechanism (see the section on http_core.c).

Figure 7. Facts extracted using the Portable Bookshelf Tools

The containment relation used in the generation of the above figure is based on Figure 3, which has been constructed using information from documentation, source code comments and read-me files.

It should be noted that the suppliers and users of http_config have not been drawn. That is because nearly everybody references this module since many .h files have been included in it. The above figure is more a validation of the concrete architecture depicted in Figure 3, rather than a source for it.

8. Conclusions

This report has offered a tour of the concrete architecture of the Apache web server (release 1.3.4). The modular architecture of the server seems to offer great opportunity for extending the code. Designers of Apache strove to move as much of the functionality as possible into the modules. Therefore, modules must implement a well defined API.

Communication between the core and the external modules is done through the modules handler functions. The module handlers are invoked to perform certain phases of processing a request. Handlers receive a reference to the request_rec, which contains the information about the request and the resources the handlers need.

We did not observe the same independence and well defined API between the components of the core. Of course there are some clear utility components that offer services to the other components of the core and to the modules, but the important parts of the core are tightly linked together. One example of this inter-dependence is the http_request.c which controls the flow of processing a request, the http_config.c component which performs the actual invocation of handlers, and the http_protocol.c which communicates with the HTTP client. This linking of the core components makes it somewhat more difficult to change the behavior of Apache by modifying the core. Fortunately, modules can do the same jobs as well, if not better, and they are usually easier to write.

Since the method of calls to its handlers is transparent to the module and all communication with a module is done through pointers to functions, we have found that fact extractors do not capture the interaction between core and modules.

9. Dictionary of terms

API

Application Programming Interface

component

term used throughout this report in order to avoid the term module which has been used in connection to (referring) an Apache module. This distinction is not a standard terminology, and has the only purpose to avoid confusion.

core (Apache)

part of the Apache server that defines and manages the steps in answering the request and implements the HTTP protocol.

CGI (script)

Common Gateway Interface, an interface describing how a web server passes parameters and receive results form another process on the same machine called CGI-script (executed by the web server when it receive a request referencing the script).

FTP

File Transport Protocol, a protocol that coordinates how binary and ascii files are transfered over the Internet.

handler

a function of a module that will be implicitly invoked by the core to handle the phase of processing the HTTP request for which the handler was designed.

HTTP

Hypertext Transport Protocol, the protocol that coordinate how the hypertext files are transfered over the Internet. However any files can be transfered via HTTP.

httpd

the usual name for the web server (stands for HTTP daemon).

IPC (mechanisms)

inter process communication mechanisms (e.g. queues, semaphores, shared memory)

MIME type

MIME stands for Multipurpose Internet Mail Extension. MIME types are the types (e.g. gif, html) of the entities defined in MIME request for comments

module (Apache module)

part of Apache server that provides some functionality in one or more phases of servicing an HTTP request. Its functions (handler) are implicitly invoked by the Apache core. It is interfaced with the Apache core by a special API.

NCSA web server

the web server provided and maintained by the Development Group of the National Center for Super-computing Applications, at the University of Illinois at Urbana - Champaign

proxy (web)

is an server that is contacted by HTTP clients to fetch web pages on their behalf from the actual web servers. It does web page caching, in order to serve subsequent request for the cached pages. a message from the client containing information about the resource requested and how it is wanted to be delivered.

request (HTTP client)

a message from the client containing information about the resource requested and how it is wanted to be delivered.

resource (HTTP)

a network data object or service which can be identified by a URI

response (HTTP server)

the response from the web server to an HTTP request, contains a header and usually the actual resource. The header contains status information and information on the resource (e.g. type, length of the binary representation).

resource pool

A large data structure allocated in one step by the Apache core, which holds the resources (memory blocks, open files) associated with a given request. When the resource pool is no longer needed it is deallocated in one step (memory is freed and files ore closed).

URI / URL

Universal Resource Identifiers / Universal Resource Locators

TCP/IP

Transfer Control Protocol / Internet Protocol, protocols suite used as transport level (TCP), network level (IP) protocols in the Internet.

virtual host

a single physical host might have more than one network interface, each with a different IP address and a different host name. For clients it acts as being a number of virtual hosts, one for each name.

10. References

[Thau96]

Design considerations for the Apache Server API, Robert Thau, Fifth International World Wide Web Conference, 1996, Paris.

[APINotes]

Apache API notes , Robert S. Thau.

[ApacheDocs]

Apache server documentation

[Preston99]

Conceptual Architecture of the Apache Web Server, Jean Preston, assignment for CS746G, Feb 2 1999, Dept. of Computer Science, University of Waterloo (http://www.grad.math.uwaterloo.ca/~je2preston/)

[Dragoi99]

Conceptual Architecture of the Apache Web Server, O. Andrei Dragoi, assignment for CS746G, Feb 2 1999, Dept. of Computer Science, University of Waterloo (http://www.grad.math.uwaterloo.ca/~oadragoi/CS746G/a1/apache_conceptual_arch.html)

|----------------------------------------------------------------------------------------|
版权声明版权所有 @zhyiwww
引用请注明来源 http://www.blogjava.net/zhyiwww
|----------------------------------------------------------------------------------------|

posted on 2008-05-09 16:39 zhyiwww 阅读(1008) 评论(0) 编辑收藏所属分类: software


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Windows下的SC命令参考 myeclipse10svn插件的安装在word中表格分页显示表头在64位ubutu上安装32位adobe flash player outlook2007导出邮件为eml文件的方法在Firefox和IE上使用del.icio.us插件修复MBR unix2dos dos2unix JVM terminated. Exit code=127 如何在debian上使用QQ

常用链接

留言簿(33)

随笔分类(626)

朋友的博客

最新随笔

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜