zhyiwww
用平实的笔,记录编程路上的点点滴滴………
posts - 536,comments - 394,trackbacks - 0
http://www.cs.ucsb.edu/~tve/cs290i-sp01/papers/Concept_Apache_Arch.htm
The Conceptual Architecture of the Apache Web Server

Octavian Andrei Dragoi ,
Department of Computer Science,
University of Waterloo,
mailto:oadragoi@neumann.uwaterloo.ca

Assignment 1
for CS746G
January 26, 1999
 

Abstract:

This report presents the conceptual (abstract) architecture of the Apache web server. It tries to emphasize the overall structure of the system, without going into implementation details, or requiring such details to be previously known by the reader. The main purpose is to make the architecture "intellectually tractable" ([Monroe97]).
The conceptual architecture has been inferred from a number of Apache related documents and from the way source files are grouped and named.
At a high level the Apache server architecture is composed of a core that implements the most basic functionality of a web server and a set of standard modules that actually service the phases of handling an HTTP request.
The server core accepts a HTTP request and implicitly invokes the appropriate handlers, sequentially, in the appropriate order, to service the request.
The report shows that the most similar architectural style (in the sense of ([Garlan94])) that can characterize the Apache architecture is "implicit invocation" , although the notion of event does not exist in the architecture.
The architecture offers great opportunities for extending or changing the Apache functionality, by the means of adding or replacing modules.
Keywords:
Apache, conceptual architecture, abstract architecture, web server
Available online at:
http://www.grad.math.uwaterloo.ca/~oadragoi/CS746G/a1/apache_conceptual_arch.html
 

1. Introduction

The goal of this report is to present the conceptual (abstract) architecture of the Apache web server. Therefore it leaves aside implementation details and tries to be simple. As stated in [Monroe97] a good architectural description make the architecture "intellectually tractable". The paper might, sometimes, simplify the actual architecture order to achieved the previously stated desiderata.

The report assumes no previous familiarity with the architecture of the Apache web server. So it can serve as an introductory reading on the architecture of the server.

It should be noted that the architecture described here might not be entirely accurate. It has been inferred based on several sources, including the overall structure of files and files name. It does not start from a previously existing complete design document.

1.1. About Apache

The Apache web server is currently the most popular web server, according to a NetCraft Survey. It has maintained (and improved) its status since mid 1996. Originally, the project was based on NCSA httpd 1.3, were from the name ("A PAtCHy Server"). Since then the code base was completely rewritten, and evolved into a completely independent project.
One of the major reasons for the Apache success is the fact that is an "open source" project (any one can have access to Apache code source, and any one can write its on modules to suits one needs). (source Apache FAQ).

May be here is the place to mention that Apache is written to be drop-in compatible with the NCSA server. This has design consequences due related to some configuration commands promoted by NCSA server, which cannot be naturally implemented in Apache. These commands are supported in a way that, somehow, is not in the general "philosophy" of the system.([Thau96]). (more details in the configuration section).

1.2. Overview

The report is organized as follows: Section 2. offers a high level view on the conceptual architecture of the Apache, outlining the main building bricks: the apache core and the apache modules. Next section (3.) gives details on the conceptual architecture of the apache core and shows what the high level anatomy of a module. It also outlines the phases of handling an HTTP request as divided by the Apache architecture. It ends with a short description of the most representative standard modules. Section 4. gives the conceptual architecture of the Apache server and analyze the concurrency in the system. Section 5. present some additional issues related to the architecture of the system, mainly how configuration fit into the hole picture, how is data passed between core and modules and how resources are allocated and managed managed. Next section (6.) comments on the architectural styles (in the sense of [Garlan94], [Shaw96]) applicable to the Apache architecture, while the Section 7. elaborates on extensibility issues. Conclusions and a dictionary of terms end the report.

2. High level Conceptual Architecture

The function of a web server is to service requests made through HTTP protocol. Typically the server receive a request asking for a specific resource and returns the resource as a response. A client might reference in its request a file, and then that file is returned or, for example, a directory and then the content of that directory (codified in some suitable form) is returned. A client might also request a program, and it is the web server task to launch that program (CGI script) and to return the output of that program to the client. Various other resources might be referenced in client's request.
To summarize: the web server take a request, decode it, obtains the resource and hands it to the client.

Additional concerns related to controlling access authorization and clients authorizations are also in the responsibility of the web server. As has been said the web server might execute programs as response to clients requests. It must ensure that this is not a threat for the host system (were the web server runs). In addition, the web server must be capable, not only to respond to a high rate of requests, but also to satisfy a request as quickly as possible.

2.1. Description

As opposed to a monolithic server architecture in which all the activities are done by a single unit (in which different parts of handling a request are poorly delimited), Apache takes a modular approach. Figure 1 illustrates the high level conceptual architecture. There is a core part of the server that is responsible for defining and following the steps in servicing a request and several modules that actually implement the different phases of handling the request.
As shall be seen later Figure 1 does not capture an important characteristic of the architecture, namely, the predefined order in which modules are called, based on their advertised characteristics.

Figure 1.High level Conceptual Architecture
The idea is to keep the basic server code clean while allowing third-parties to override or extend even basic characteristics.

3. The core and the modules

This section presents in more detail the components of the Apache server architecture. It presents the conceptual parts of the Apache core and how a request is decomposed in a set of phases. It also describe the anatomy of an Apache module (at a conceptual level).

3.1. The core

The core implements the basic functionality of the server. In addition it implements a number of utility functions. A worth mentioning utility, is the one that provides resources allocation on a per request pool. This facility is offered not only to the server core but also to modules.

The following are the components of the core:

  • http_protocol.c: contains routines that directly communicates with the client (through the socket connection), following the HTTP protocol. All data transfers to the client are done using this component.
  • http_main.c: the component that startup the server and contains the main server loop that waits for and accepts connections. It is also in charge of managing timeouts.
  • http_request.c the component that handles the flow of the request processing, dispatching control to the modules in the appropriate order. It is also in charge with error handling.
  • http_core.c: the component implementing the most basic functionality, which is described in a comment from a source file as being "just 'barely' functional enough to serve documents, though not terribly well". Another interesting quote from a source file comment illustrates very well the function of this component:"this file could almost be mod_core.c". Meaning that the component behaves like a module but has to access some globals directly (which is not characteristic for a module).
  • the component that take care of allocating resource pools, and keeping track of them. (alloc.c)
  • other utilities, including reading configuration files and managing the information gathered from those files (http_config.c), as well as support for virtual hosts. An important function of http_config is that form the list of modules that will be called to service different phases of the requests.
In the above list the term component has been used in order to avoid the term module which will be used only to refers to Apache modules
Figure 2. Architecture for Apache core
Figure 2. depicts the interaction between different components of the core. As all components use the different utilities functions, connectors to UTILITIES and ALLOC have not been pictured. Interaction is used in a broader sense, meaning from calling a component service function to "conceptually" relinquish control to that component.

It is interesting to observed that although the components of the core have rather distinct functionality, there is not a simple way to depict the interactions between them. Most of the architectural information being in the names of the modules rather than in the connectors between them.

This is due to the considerably effort done by the designers to move everything that can be expressed as a separate entity into the modules part of the Apache server. What is left in the core are components too interconnected to be written as separate modules.

3.2. Request Phases

A module implements only portion of the functionality for servicing a client request. More than one module are necessary to completely respond to a request. However module does not know one about the other. The control is transfered back and forth between the core and different modules. This is handled by dividing the handling of the request into a set of distinct phases.

The following are the phases of handling a request for the Apache server:

  • URI to filename translation;
  • Check access based on host address, and other available information;
  • Get an user id from the HTTP request and validate it;
  • Authorize the user;
  • Determine the MIME type of the requested object (the content type, the encoding and the language);
  • Fix-ups (for example replace aliases by the actual path);
  • Send the actual data back to the client;
  • Log the request;
The phases are "controlled" by the http_request component of the core as has been already stated (see Figure 2.).

3.3. Modules

As has been said the role of the modules is to implement/override/extend the functionality of the Apache web server. All modules has the same interface to the core of the server. Module does not interact directly one with another. If they interact it is always through the Apache core.(implicit invocation as shall be seen).

Figure 3. Architecture of an Apache Module
Apache (1.3) permits loading of modules when they are needed (they are dynamically linked with the server) and therefore the initialization and configuration methods might be called when the module is loaded as opposed to when the server is initialized.

3.4. Handlers

A handler is for Apache the action that must be performed in some phase of servicing a request. For example when the requested object is a file, the handler that returns the the file must open the file, read the content of the file and hand the content of the file to the client (through apache core).

Handlers are defined by modules, and a module might specify handlers for one, many or none of the phases of a request. Handlers are the part of the module that is called when the processing of the request enters the phase for which the handler is defined.

The rationale behind having modules defining handlers for more than one phase is that a module might save internally data on the request being processed, and when its handlers for a subsequent phase of the request are called they might make use of those the data. In theory the module might even save data between different request (e.g. it might cash some file content for future use).

It should be noted that there are additional functions exported by modules, related with configuration, and initialization, They are called in the startup phase of the server.

3.5. Standard Modules

Apache comes with a set of standard modules for providing the complete functionality of a web server. The most representative/relevant among the standard modules are listed below. They also illustrate what kind of manipulation can be done at each phase.
  • For URI to file name translation phase:
    • mod_userdir: translate the user home directories into actual paths mod_rewrite Apache 1.2 and up
    • mod_rewrite: rewrites URLs based on regular expressions, it has additional handlers for fix-ups and for determining the mime type
  • For authentication / authorization phases:
    • mod_auth, mod_auth_anon,mod_auth_db, mod_auth_dbm : User authentication using text files, anonymous in FTP-style, using Berkeley DB files, using DBM files.
    • mod_access: host based access control.
  • For determining the MIME type of the requested object (the content type, the encoding and the language):
    • mod_mime: determines document types using file extensions.
    • mod_mime_magic: determines document types using "magic numbers" (e.g. all gif files start with a certain code)
  • For fix-ups phase:
    • mod_alias: replace aliases by the actual path
    • mod_env: fix-up the environment (based on information in configuration files)
    • mod_speling: automatically correct minor typos in URLs
  • For sending actual data back to the client: to chose the appropriate module for this phase the mime type or the pseudo mime type (e.g. for a CGI-script) is used.
    • mod_actions: file type/method-based script execution
    • mod_asis: send the file as it is
    • mod_autoindex: send an automatic generated representation of a directory listing
    • mod_cgi: invokes CGI scripts and returns the result
    • mod_include: handles server side includes (documents parse by server which includes certain additional data before handing the document to the client)
    • mod_dir: basic directory handling.
    • mod_imap: handles image-map file
  • For logging the request phase:
    • mod_log_*: various types of logging modules

4. Conceptual Architecture

Figure 1 has shown which are the main components of the Apache web server and how they interact. However it does not illustrate the fact that handlers in modules are called in a fixed, predefined order, which is the order of the phases of servicing a request. Figure 3 tries to add the flow information mention above.

For some phases only one module (handler in a module) can be called. Such phases are the authorization, the authentication, the return of the actual object to the client, and sometimes the URI to filename translation.
Other phases of servicing a request can have more that one handler called. For example there can be more than one module called to implement the logging part of the request.

In some phases of processing a request all the handlers (in the registered modules) might be called until one returns a special code meaning that subsequent registered handlers for the current phase should not be called. An example is the URI to filename, translation phase.
Further more there might be the case that a handler returns an error code. In that case the processing of the request should stop and an error should be returned to the client (i.e. no other handlers are called, from this phase or subsequent phases).

Figure 4. Conceptual Architecture of Apache Server

4.1. Concurrency in Apache

Some web sites are heavily loaded (many requests per minute or even per second). Traditionally TCP/IP servers fork a new child to handle new incoming request from clients. However in the situation of a busy web site the overhead of fork-ing a huge number of children will simply suffocate the machine.

As a consequence, Apache uses a different technique, namely persistent server processes. It forks a fixed number of children, right from the beginning. The children service incoming requests independently (different address spaces). Concurrency in Apache server is pictured in Figure 5.
Alternatively, when Apache compiles on MS Windows (as opposed to UNIX), a fixed number of threads is started from the beginning to service the incoming request (due probably to specific characteristic of this operating system).

Figure 5. Concurrency on Apache(UNIX)
It is interesting that Apache server can dynamically control the number of children it forks (i.e. increasing or decreasing it), based on current load.

From another point of view one might raise the question if a module is a separated process or can be implemented as a separated process. In Apache module is not a separated process. However some modules might fork new children in order to do their job. A readily example is the mod_cgi module, which handles the cgi script. It must fork a new child to execute the actual CGI script (after proper redirection of the standard input and output for the child process), and wait for it to finish. But this is a characteristic of the mod_cgi, many other modules need not to fork children.

A different kind of module is the one that although it is not a separate process and does not for children it communicate through IPC mechanisms or sockets in with a different process (which might, for instance, be located on a different machine). An example of such module would be an authorization module which communicate with a server that manages users and passwords information. Even the CGI module might be implemented in this way (i.e. the actual script running as a completely different process not a child) which will result in improved security, but will have the communication overhead as a penalty.

5. Additional issues

Some additional issues has been left aside from the description of the conceptual architecture and are treated in the next sections.

5.1. Configuration of Apache Components

One of the declared purposes of the Apache server architecture is to make it highly customizable.
Configuration files permit to customize not only the behavior of the server but the one of the modules too. Each module can advertise the custom commands it recognize from configuration files and will be called when such commands are found. Those commands might be completely new commands (not known in by the server core).
Apache permits even per directory customization via a file call .htaccess. This file also might contain commands understand only by modules.

An interesting concept implemented by Apache is the one of Virtual Hosts. The server can respond to more than one name (i.e. www.example and www2.example), each assigned to one of the multiple IP addresses of the machine. The multiple IP addresses can be addresses associated with physical network interfaces or can be addresses associated with virtual network interfaces (simulated via logical devices by the operating system). Apache is able to "tell" under which name the host has been referenced and use different configuration options (e.g. allowing more access rights to users accessing the host through an interface networked in the local network, as opposed to users accessing the web server via an interface networked in the outside-the-company network). Modules also have accessed to this information.

To summarize, the Apache "philosophy" related to configuration is: each component takes care of its own configuration, and configuration commands. The server core parse the configuration files and dispatches configuration commands to the appropriate modules to be interpreted (executed), or interprets (executes) the command itself if in particular was meant for it (i.e. is a configuration command for the core not for a module).

5.2. Compatibility with NCSA server - impact on Architecture

Starting from the code base of the NCSA server Apache was always design to be a drop-in replacement for this server. That means that Apache must understand and follow the configuration commands, and recognize the configuration files of the NCSA server. However this is not an easy task because some of the commands must affect behavior that appear in more than one module. Therefore one of the main principle of the Apache configuration machinery, namely each module takes care of its own configuration must be broken .

To "fix" this the problem commands of NCSA server (e.g. Options) are interpreted by the Apache core, even when they affect modules. The core make the configuration available to modules in the same way it make available the general configuration information.

5.3. Data Flow / Data Structures

Data is exchanged with various handlers in modules via a special structure called request record which includes information about the resource requested (e.g. filename), information about the configuration data related to the server, the virtual host, and the directory context in which the request is processed.

Another key structure is the one the Apache core uses keep track of various modules. It is a linked list of module records, each holding all the information related to that module (e.g. handlers, configuration data per module). The module record is the mean by which the core calls the module.

5.4.Resource Allocation - Resource Pool

An interesting characteristic of the Apache server it the concept of resource pool. All resources related to a request (memory, file handlers) are allocated and handled through a dedicated resource pool. Further more, modules can define their own sub-resource pools if they want to manage private resources in a similar manner with general resources.

What is characteristic for the resource pool, is that all resources are freed at once, when the resource pool is freed, preventing resource leakage. This is particularly important due to use of persistent processes.

6. Architectural Style

The conceptual architecture described above, roughly approximate the style of "implicit invocation". It should be noted however that the architecture is not exactly an Event based architecture, as specified in [Garlan94]. It is usually the case with software architecture that cannot be clearly classified in a predefined style ("Real systems hybridize and amalgamate the pure style" - [Shaw96]).
To be more specific there is no such concept as many events that are announced (broadcast). Instead the only event is a request from a HTTP client, which starts a sequence of predictable implicit invocations.
The core has a fixed order in which will call the different handlers and will decide based on configuration information which is the order in which the handlers for the same phase are called.

There is, however, something that might be compared with announcing an event, namely is the issuing of a sub-request by a module in order to "force" the core to perform some of the steps for a request on the sub-request (i.e. calling sequentially handlers for each servicing phase). However this is not (conceptually) a proper event, because the issuing module does not announce something to other (unknown to it) modules. It just a mean of "forcing" an implicit invocation.

There are other characteristics of event systems (as summarized in [Shaw96]) that does not "fit" the description of the core-modules architecture of Apache. For example there is no control asynchrony, in the sense that the module issuing a the sub-request waits for the sub-request to be completed.
Also two phases of the request cannot be handled in parallel (one uses the outcome of the precedent one). More over the module is not a separate process, although it can fork children for some phases - like running a CGI script.

So although the connectors between modules are implicit invocations and data flow is a tree - with some restrictions (e.g some phases cannot have more than one module to handle them, one phase is after the other) the architecture does not have other characteristics of the event systems.

It can be argued however that as different instances of Apache (sub-processes) can handle in the same time request from different HTTP clients there is asynchrony. However the different instances are independent and do not shared information related to the requests processed.

The way a request is serviced, with phases handled one after the other and the outcome of a request is used (most of the time) by the next phase, has some similarities with the general style of "pipe line" (as in [Shaw96])). There is no upstream control (i.e. when the core invokes the handlers for one phase there is no data or control upstream). However, again, there is no asynchrony and more important the core regain control after each phase (i.e. after the handler has been invoked, and its job is done).

Further more, some phases does not provide any change in the conceptual data-flow. And more significant, some handlers might be implemented by the same module and those handler might exchange information via private data of the module, bypassing the main data-flow. For example authorization and authentication does not change the request, they can only deny the execution of it. To conclude the pipeline is rather poorly reflected by the module structures, although conceptually the idea exists, therefore the implicit invocation seems more appropriate to characterize the general conceptual architectural style.

7. Extensibility of Apache

As it probably became obvious by now, Apache server architecture easily permits changes of the existing functionality or adding new functionality.
The modular approach and the effort made by the designers to move as much as possible from the web server functionality into separate modules make the task easier. For example if the way URI are translated into file names have to be extended, it is not necessary to change the module that does this task. It is sufficient to write a different module which will be called before or after the standard module has been called.

Further more the ability of dynamically loading modules present in Apache 1.3 release (no static linking with the server code), make the task of customizing the server even easier as there is no need to recompile the entire server. It is necessarily only to change some configuration files.
Another feature worth re-mentioning here is the capability of modules to define their own configuration commands, for which they are implicitly called to execute.

An important part of the Apache web server that cannot be changed only by changing / adding a module is the one that implements the HTTP protocol. On the good, side the protocol is implemented as a separate piece of code (http_protocol.c), and all communication with the client is done through it, so only that part must be changed in order to implement a future version of HTTP. However there is no well defined API, as is the case for modules.

8. Conclusions

Apache web server has a modular architecture with a core component that defines the most basic functionality of a web server (including the HTTP protocol and the reading of configuration files) and a number of modules which implements the steps of processing a HTTP request, offering handlers for one or more of the phases.

The core is the one that accepts and manages HTTP connections and calls the handlers in modules in the appropriate order to service the current request.

The architectural style can be characterized implicit invocation made by the server core on handlers implemented by the modules. Concurrency exists only between a number of persistent identical processes that service incoming HTTP requests on the same port. Modules are not implemented as separate process although it is possible to fork children or to cooperate with other independent process to handle a phase of processing a request.

The functionality of Apache can be easily changed by writing new modules which complements or replace the existing one. The server is also highly configurable, at different levels (virtual host, directory, module) and modules can define their own configuration commands.

9. Dictionary of terms

API
Application Programming Interface
component
term used throughout this report in order to avoid the term module which has been used in connection to (referring) an Apache module. This distinction is not a standard terminology, and has the only purpose to avoid confusion.
core (Apache core)
part of the Apache server that defines and manages the steps in answering the request and implements the HTTP protocol.
CGI (CGI script)
Common Gateway Interface, an interface describing how a web server passes parameters and receive results form another process on the same machine called CGI-script (executed by the web server when it receive a request referencing the script).
handler
a function of a module that will be implicitly invoked by the core to handle the phase of processing the HTTP request for which the handler was designed.
HTTP
Hypertext Transport Protocol, the protocol that coordinate how the hypertext files are transfered over the Internet. However any files can be transfered via HTTP.
httpd
the usual name for the web server (stands for HTTP daemon).
IPC (IPC mechanisms)
inter process communication mechanisms (e.g. queues, semaphores, shared memory)
MIME type
MIME stands for Multipurpose Internet Mail Extension. MIME types are the types (e.g. gif, html) of the entities defined in MIME request for comments
module (Apache module)
part of Apache server that provides some functionality in one or more phases of servicing an HTTP request. Its functions (handler) are implicitly invoked by the Apache core. It is interfaced with the Apache core by a special API.
NCSA web server (NCSA httpd)
the web server provided and maintained by the Development Group of the National Center for Super-computing Applications, at the University of Illinois at Urbana - Champaign
request (HTTP client request)
a message from the client containing information about the resource requested and how it is wanted to be delivered.
resource (an HTTP resource)
a network data object or service which can be identified by a URI
response (HTTP server response)
the response from the web server to an HTTP request, contains a header and usually the actual resource. The header contains status information and information on the resource (e.g. type, length of the binary representation).
resource pool
A large data structure allocated in one step by the Apache core, which holds the resources (memory blocks, open files) associated with a given request. When the resource pool is no longer needed it is deallocated in one step (memory is freed and files ore closed).
URI
Universal Resource Identifiers, are formated (fixed syntax) string which identify objects via location, and other characteristics.
URL
Uniform Resource Locators, a subclass of URI that locates resources based on their location and the protocol used to fetch them (e.g. http://www.uwaterloo.ca/index.html identifies the home page file of University of Waterloo)
virtual host
a single physical host might have more than one network interface, each with a different IP address and a different host name. For clients it acts as being a number of virtual hosts, one for each name.

10. References

[Thau96]
Design considerations for the Apache Server API, Robert Thau, Fifth International World Wide Web Conference, 1996, Paris.
[APINotes]
Apache API notes , Robert S. Thau.
[ApacheDocs]
Apache server documentation
[Garlan94]
An Introduction to Software Architecture, D. Garlan, M.Shaw, Advances in Software Engineering and Knowledge Engineering, Vol. I, World Scientific Publishing Company, 1993.
[Monroe97]
Architectural Styles, Design Patterns, and Objects, R. Monroe, D. Kompanek, R. Melton, D. Garlan, IEEE Software, January 1997, pp 43-52.
[Shaw96]
A Field Guide to Boxology, M. Shaw, P. Clementes, 1996


|----------------------------------------------------------------------------------------|
                           版权声明  版权所有 @zhyiwww
            引用请注明来源 http://www.blogjava.net/zhyiwww   
|----------------------------------------------------------------------------------------|
posted on 2008-05-09 16:30 zhyiwww 阅读(1614) 评论(0)  编辑  收藏 所属分类: software

只有注册用户登录后才能发表评论。


网站导航: