The Conceptual Architecture of the Apache Web
Server
Abstract:
This report presents the conceptual (abstract)
architecture of the Apache web server. It tries to emphasize the overall
structure of the system, without going into implementation details, or
requiring such details to be previously known by the reader. The main purpose
is to make the architecture "intellectually tractable" ([Monroe97]).
The
conceptual architecture has been inferred from a number of Apache related
documents and from the way source files are grouped and named.
At a high
level the Apache server architecture is composed of a core that
implements the most basic functionality of a web server and a set of standard
modules that actually service the phases of handling an HTTP
request.
The server core accepts a HTTP request and implicitly invokes the
appropriate handlers, sequentially, in the appropriate order, to service the
request.
The report shows that the most similar architectural style (in the
sense of ([Garlan94]))
that can characterize the Apache architecture is "implicit invocation"
, although the notion of event does not exist in the architecture.
The
architecture offers great opportunities for extending or changing the Apache
functionality, by the means of adding or replacing modules.
Keywords:
Apache, conceptual architecture, abstract
architecture, web server
Available online at:
http://www.grad.math.uwaterloo.ca/~oadragoi/CS746G/a1/apache_conceptual_arch.html
The goal of this
report is to present the conceptual (abstract) architecture of the Apache web
server. Therefore it leaves aside implementation details and tries to be simple.
As stated in [Monroe97]
a good architectural description make the architecture "intellectually
tractable". The paper might, sometimes, simplify the actual architecture order
to achieved the previously stated desiderata.
The report assumes no previous familiarity with the architecture of the
Apache web server. So it can serve as an introductory reading on the
architecture of the server.
It should be noted that the architecture described here might not be entirely
accurate. It has been inferred based on several sources,
including the overall structure of files and files name. It does not start from
a previously existing complete design document.
1.1. About Apache
The Apache web server is
currently the most popular web server, according to a
NetCraft Survey. It has maintained
(and improved) its status since mid 1996. Originally, the project was based on
NCSA httpd 1.3, were from the name ("
A PAt
CHy Server"). Since then
the code base was completely rewritten, and evolved into a completely
independent project.
One of the major reasons for the Apache success is the
fact that is an "open source" project (any one can have access to Apache code
source, and any one can write its on modules to suits one needs). (
source Apache FAQ).
May be here is the place to mention that Apache is written to be drop-in
compatible with the NCSA server. This has design consequences due related to
some configuration commands promoted by NCSA server, which cannot be naturally
implemented in Apache. These commands are supported in a way that, somehow, is
not in the general "philosophy" of the system.([Thau96]).
(more details in the configuration section).
1.2. Overview
The report is organized as follows:
Section
2. offers a high level view on the conceptual architecture of the Apache,
outlining the main building bricks: the apache core and the apache modules. Next
section
(3.) gives details on the conceptual architecture of the apache core and
shows what the high level anatomy of a module. It also outlines the phases of
handling an HTTP request as divided by the Apache architecture. It ends with a
short description of the most representative standard modules.
Section
4. gives the conceptual architecture of the Apache server and analyze the
concurrency in the system.
Section
5. present some additional issues related to the architecture of the system,
mainly how configuration fit into the hole picture, how is data passed between
core and modules and how resources are allocated and managed managed. Next
section
(6.) comments on the architectural styles (in the sense of
[Garlan94],
[Shaw96])
applicable to the Apache architecture, while the
Section
7. elaborates on extensibility issues.
Conclusions
and a
dictionary
of terms end the report.
The function of a web server is to service requests
made through HTTP protocol. Typically the server receive a request asking for a
specific resource and returns the resource as a response. A client might
reference in its request a file, and then that file is returned or, for example,
a directory and then the content of that directory (codified in some suitable
form) is returned. A client might also request a program, and it is the web
server task to launch that program (CGI script) and to return the output of that
program to the client. Various other resources might be referenced in client's
request.
To summarize: the web server take a request, decode it, obtains the
resource and hands it to the client.
Additional concerns related to controlling access authorization and clients
authorizations are also in the responsibility of the web server. As has been
said the web server might execute programs as response to clients requests. It
must ensure that this is not a threat for the host system (were the web server
runs). In addition, the web server must be capable, not only to respond to a
high rate of requests, but also to satisfy a request as quickly as possible.
2.1. Description
As opposed to a monolithic server
architecture in which all the activities are done by a single unit (in which
different parts of handling a request are poorly delimited), Apache takes a
modular approach.
Figure
1 illustrates the high level conceptual architecture. There is a
core part of the server that is responsible for defining and following
the steps in servicing a request and several
modules that actually
implement the different phases of handling the request.
As shall be seen
later
Figure
1 does not capture an important characteristic of the architecture,
namely, the predefined order in which modules are called, based on their
advertised characteristics.
|
Figure 1.High level Conceptual
Architecture |
The idea is to
keep the basic server code clean while allowing third-parties to override or
extend even basic characteristics.
This section presents in more detail the components of
the Apache server architecture. It presents the conceptual parts of the Apache
core and how a request is decomposed in a set of phases. It also describe the
anatomy of an Apache module (at a conceptual level).
3.1. The core
The core implements the basic
functionality of the server. In addition it implements a number of utility
functions. A worth mentioning utility, is the one that provides resources
allocation on a per request pool. This facility is offered not only to the
server core but also to modules.
The following are the components of the core:
http_protocol.c
: contains routines that directly communicates
with the client (through the socket connection), following the HTTP protocol.
All data transfers to the client are done using this component.
http_main.c
: the component that startup the server and
contains the main server loop that waits for and accepts connections. It is
also in charge of managing timeouts.
http_request.c
the component that handles the flow of the
request processing, dispatching control to the modules in the appropriate
order. It is also in charge with error handling.
http_core.c
: the component implementing the most basic
functionality, which is described in a comment from a source file as being
"just 'barely' functional enough to serve documents, though not terribly
well". Another interesting quote from a source file comment illustrates
very well the function of this component:"this file could almost be
mod_core.c". Meaning that the component behaves like a module but has to
access some globals directly (which is not characteristic for a module).
- the component that take care of allocating resource pools, and keeping
track of them. (
alloc.c
)
- other utilities, including reading configuration files and managing the
information gathered from those files (
http_config.c
), as well as
support for virtual hosts. An important function of http_config
is that form the list of modules that will be called to service different
phases of the requests.
In the above list the term
component
has been used in order to avoid the term
module which will be used
only to refers to
Apache modules |
Figure 2. Architecture for Apache
core |
Figure
2. depicts the interaction between different components of the core. As all
components use the different utilities functions, connectors to UTILITIES and
ALLOC have not been pictured.
Interaction is used in a broader sense,
meaning from calling a component service function to "conceptually" relinquish
control to that component.
It is interesting to observed that although the components of the core have
rather distinct functionality, there is not a simple way to depict the
interactions between them. Most of the architectural information being in the
names of the modules rather than in the connectors between them.
This is due to the considerably effort done by the designers to move
everything that can be expressed as a separate entity into the modules part of
the Apache server. What is left in the core are components too interconnected to
be written as separate modules.
3.2. Request Phases
A module implements only
portion of the functionality for servicing a client request. More than one
module are necessary to completely respond to a request. However module does not
know one about the other. The control is transfered back and forth between the
core and different modules. This is handled by dividing the handling of the
request into a set of distinct phases.
The following are the phases of handling a request for the Apache server:
- URI to filename translation;
- Check access based on host address, and other available information;
- Get an user id from the HTTP request and validate it;
- Authorize the user;
- Determine the MIME type of the requested object (the content type, the
encoding and the language);
- Fix-ups (for example replace aliases by the actual path);
- Send the actual data back to the client;
- Log the request;
The phases are "controlled" by the http_request
component of the core as has been already stated (see
Figure
2.).
3.3. Modules
As has been said the role of the
modules is to implement/override/extend the functionality of the Apache web
server. All modules has the same interface to the core of the server. Module
does not interact directly one with another. If they interact it is always
through the Apache core.(implicit invocation as shall be seen).
|
Figure 3. Architecture of an Apache
Module |
Apache (1.3) permits
loading of modules when they are needed (they are dynamically linked with the
server) and therefore the initialization and configuration methods might be
called when the module is loaded as opposed to when the server is initialized.
3.4. Handlers
A handler is for Apache the action
that must be performed in some phase of servicing a request. For example when
the requested object is a file, the handler that returns the the file must open
the file, read the content of the file and hand the content of the file to the
client (through apache core).
Handlers are defined by modules, and a module might specify handlers for one,
many or none of the phases of a request. Handlers are the part of the module
that is called when the processing of the request enters the phase for which the
handler is defined.
The rationale behind having modules defining handlers for more than one phase
is that a module might save internally data on the request being processed, and
when its handlers for a subsequent phase of the request are called they might
make use of those the data. In theory the module might even save data between
different request (e.g. it might cash some file content for future use).
It should be noted that there are additional functions exported by modules,
related with configuration, and initialization, They are called in the startup
phase of the server.
3.5. Standard Modules
Apache comes with a set of
standard modules for providing the complete functionality of a web server. The
most representative/relevant among the standard modules are listed below. They
also illustrate what kind of manipulation can be done at each phase.
- For URI to file name translation phase:
mod_userdir
: translate the user home directories into
actual paths mod_rewrite Apache 1.2 and up
mod_rewrite
: rewrites URLs based on regular expressions, it
has additional handlers for fix-ups and for determining the mime type
- For authentication / authorization phases:
mod_auth, mod_auth_anon,mod_auth_db, mod_auth_dbm
: User
authentication using text files, anonymous in FTP-style, using Berkeley DB
files, using DBM files.
mod_access
: host based access control.
- For determining the MIME type of the requested object (the content type,
the encoding and the language):
mod_mime
: determines document types using file extensions.
mod_mime_magic
: determines document types using "magic
numbers" (e.g. all gif files start with a certain code)
- For fix-ups phase:
mod_alias
: replace aliases by the actual path
mod_env
: fix-up the environment (based on information in
configuration files)
mod_speling
: automatically correct minor typos in URLs
- For sending actual data back to the client: to chose the appropriate
module for this phase the mime type or the pseudo mime type (e.g. for a
CGI-script) is used.
mod_actions
: file type/method-based script execution
mod_asis
: send the file as it is
mod_autoindex
: send an automatic generated representation
of a directory listing
mod_cgi
: invokes CGI scripts and returns the result
mod_include
: handles server side includes (documents parse
by server which includes certain additional data before handing the document
to the client)
mod_dir
: basic directory handling.
mod_imap
: handles image-map file
- For logging the request phase:
mod_log_*
: various types of logging modules
Figure
1 has shown which are the main components of the Apache web server and how
they interact. However it does not illustrate the fact that handlers in modules
are called in a fixed, predefined order, which is the order of the phases of
servicing a request.
Figure
3 tries to add the flow information mention above.
For some phases only one module (handler in a module) can be called. Such
phases are the authorization, the authentication, the return of the actual
object to the client, and sometimes the URI to filename translation.
Other
phases of servicing a request can have more that one handler called. For example
there can be more than one module called to implement the logging part of the
request.
In some phases of processing a request all the handlers (in the registered
modules) might be called until one returns a special code meaning that
subsequent registered handlers for the current phase should not be called. An
example is the URI to filename, translation phase.
Further more there might
be the case that a handler returns an error code. In that case the processing of
the request should stop and an error should be returned to the client (i.e. no
other handlers are called, from this phase or subsequent phases).
|
Figure 4. Conceptual Architecture of Apache
Server |
4.1. Concurrency in Apache
Some web sites are
heavily loaded (many requests per minute or even per second). Traditionally
TCP/IP servers fork a new child to handle new incoming request from clients.
However in the situation of a busy web site the overhead of fork-ing a huge
number of children will simply suffocate the machine.
As a consequence, Apache uses a different technique, namely persistent
server processes. It forks a fixed number of children, right from the
beginning. The children service incoming requests independently (different
address spaces). Concurrency in Apache server is pictured in Figure
5.
Alternatively, when Apache compiles on MS Windows (as opposed to
UNIX), a fixed number of threads is started from the beginning to service the
incoming request (due probably to specific characteristic of this operating
system).
|
Figure 5. Concurrency on
Apache(UNIX) |
It is
interesting that Apache server can dynamically control the number of children it
forks (i.e. increasing or decreasing it), based on current load.
From another point of view one might raise the question if a module is a
separated process or can be implemented as a separated process. In Apache module
is not a separated process. However some modules might fork new children in
order to do their job. A readily example is the mod_cgi
module,
which handles the cgi script. It must fork a new child to execute the actual CGI
script (after proper redirection of the standard input and output for the child
process), and wait for it to finish. But this is a characteristic of the
mod_cgi
, many other modules need not to fork children.
A different kind of module is the one that although it is not a separate
process and does not for children it communicate through IPC mechanisms or
sockets in with a different process (which might, for instance, be located on a
different machine). An example of such module would be an authorization module
which communicate with a server that manages users and passwords information.
Even the CGI module might be implemented in this way (i.e. the actual script
running as a completely different process not a child) which will result in
improved security, but will have the communication overhead as a penalty.
Some
additional issues has been left aside from the description of the conceptual
architecture and are treated in the next sections.
5.1. Configuration of Apache Components
One of the
declared purposes of the Apache server architecture is to make it highly
customizable.
Configuration files permit to customize not only the behavior
of the server but the one of the modules too. Each module can advertise the
custom commands it recognize from configuration files and will be called when
such commands are found. Those commands might be completely new commands (not
known in by the server core).
Apache permits even per directory customization
via a file call
.htaccess. This file also might contain commands
understand only by modules.
An interesting concept implemented by Apache is the one of Virtual
Hosts. The server can respond to more than one name (i.e. www.example and
www2.example), each assigned to one of the multiple IP addresses of the machine.
The multiple IP addresses can be addresses associated with physical network
interfaces or can be addresses associated with virtual network interfaces
(simulated via logical devices by the operating system). Apache is able to
"tell" under which name the host has been referenced and use different
configuration options (e.g. allowing more access rights to users accessing the
host through an interface networked in the local network, as opposed to users
accessing the web server via an interface networked in the outside-the-company
network). Modules also have accessed to this information.
To summarize, the Apache "philosophy" related to configuration is: each
component takes care of its own configuration, and configuration commands. The
server core parse the configuration files and dispatches configuration commands
to the appropriate modules to be interpreted (executed), or interprets
(executes) the command itself if in particular was meant for it (i.e. is a
configuration command for the core not for a module).
5.2. Compatibility with NCSA server - impact on
Architecture
Starting from the code base of the NCSA server Apache
was always design to be a drop-in replacement for this server. That means that
Apache must understand and follow the configuration commands, and recognize the
configuration files of the NCSA server. However this is not an easy task because
some of the commands must affect behavior that appear in more than one module.
Therefore one of the main principle of the Apache configuration machinery,
namely each module takes care of its own configuration
must be broken .
To "fix" this the problem commands of NCSA server (e.g. Options) are
interpreted by the Apache core, even when they affect modules. The core make the
configuration available to modules in the same way it make available the general
configuration information.
5.3. Data Flow / Data Structures
Data is exchanged
with various handlers in modules via a special structure called
request
record which includes information about the resource requested (e.g.
filename), information about the configuration data related to the server, the
virtual host, and the directory context in which the request is processed.
Another key structure is the one the Apache core uses keep track of various
modules. It is a linked list of module records, each holding all the
information related to that module (e.g. handlers, configuration data per
module). The module record is the mean by which the core calls the module.
5.4.Resource Allocation - Resource Pool
An
interesting characteristic of the Apache server it the concept of
resource
pool. All resources related to a request (memory, file handlers) are
allocated and handled through a dedicated resource pool. Further more, modules
can define their own sub-resource pools if they want to manage private resources
in a similar manner with general resources.
What is characteristic for the resource pool, is that all resources are freed
at once, when the resource pool is freed, preventing resource leakage. This is
particularly important due to use of persistent processes.
The
conceptual architecture described above, roughly approximate the style of
"implicit invocation". It should be noted however that the architecture
is not exactly an
Event based architecture, as specified in
[Garlan94].
It is usually the case with software architecture that cannot be clearly
classified in a predefined style ("Real systems hybridize and amalgamate the
pure style" -
[Shaw96]).
To be more specific there
is no such concept as many events that are announced (broadcast). Instead the
only event is a request from a HTTP client, which starts a sequence of
predictable implicit invocations.
The core has a fixed order in which will
call the different handlers and will decide based on configuration information
which is the order in which the handlers for the same phase are called.
There is, however, something that might be compared with announcing an event,
namely is the issuing of a sub-request by a module in order to "force" the core
to perform some of the steps for a request on the sub-request (i.e. calling
sequentially handlers for each servicing phase). However this is not
(conceptually) a proper event, because the issuing module does not announce
something to other (unknown to it) modules. It just a mean of "forcing" an
implicit invocation.
There are other characteristics of event systems (as summarized in [Shaw96]) that does not "fit" the description of the
core-modules architecture of Apache. For example there is no control asynchrony,
in the sense that the module issuing a the sub-request waits for the sub-request
to be completed.
Also two phases of the request cannot be handled in parallel
(one uses the outcome of the precedent one). More over the module is not a
separate process, although it can fork children for some phases - like running a
CGI script.
So although the connectors between modules are implicit invocations
and data flow is a tree - with some restrictions (e.g some phases cannot have
more than one module to handle them, one phase is after the other) the
architecture does not have other characteristics of the event systems.
It can be argued however that as different instances of Apache
(sub-processes) can handle in the same time request from different HTTP clients
there is asynchrony. However the different instances are independent and do not
shared information related to the requests processed.
The way a request is serviced, with phases handled one after the other and
the outcome of a request is used (most of the time) by the next phase, has some
similarities with the general style of "pipe line" (as in [Shaw96])). There is no upstream control (i.e. when the
core invokes the handlers for one phase there is no data or control upstream).
However, again, there is no asynchrony and more important the core regain
control after each phase (i.e. after the handler has been invoked, and its
job is done).
Further more, some phases does not provide any change in the conceptual
data-flow. And more significant, some handlers might be implemented by the same
module and those handler might exchange information via private data of the
module, bypassing the main data-flow. For example authorization and
authentication does not change the request, they can only deny the execution of
it. To conclude the pipeline is rather poorly reflected by the module
structures, although conceptually the idea exists, therefore the implicit
invocation seems more appropriate to characterize the general conceptual
architectural style.
As
it probably became obvious by now, Apache server architecture easily permits
changes of the existing functionality or adding new functionality.
The
modular approach and the effort made by the designers to move as much as
possible from the web server functionality into separate modules make the task
easier. For example if the way URI are translated into file names have to be
extended, it is not necessary to change the module that does this task. It is
sufficient to write a different module which will be called before or after the
standard module has been called.
Further more the ability of dynamically loading modules present in Apache 1.3
release (no static linking with the server code), make the task of customizing
the server even easier as there is no need to recompile the entire server. It is
necessarily only to change some configuration files.
Another feature worth
re-mentioning here is the capability of modules to define their own
configuration commands, for which they are implicitly called to execute.
An important part of the Apache web server that cannot be changed only by
changing / adding a module is the one that implements the HTTP protocol. On the
good, side the protocol is implemented as a separate piece of code
(http_protocol.c
), and all communication with the client is done
through it, so only that part must be changed in order to implement a future
version of HTTP. However there is no well defined API, as is the case for
modules.
Apache web
server has a modular architecture with a core component that defines the most
basic functionality of a web server (including the HTTP protocol and the reading
of configuration files) and a number of modules which implements the steps of
processing a HTTP request, offering handlers for one or more of the phases.
The core is the one that accepts and manages HTTP connections and calls the
handlers in modules in the appropriate order to service the current request.
The architectural style can be characterized implicit invocation made
by the server core on handlers implemented by the modules. Concurrency exists
only between a number of persistent identical processes that service incoming
HTTP requests on the same port. Modules are not implemented as separate process
although it is possible to fork children or to cooperate with other independent
process to handle a phase of processing a request.
The functionality of Apache can be easily changed by writing new modules
which complements or replace the existing one. The server is also highly
configurable, at different levels (virtual host, directory, module) and modules
can define their own configuration commands.
APIApplication Programming Interface
componentterm used throughout this report in order to avoid the term module
which has been used in connection to (referring) an Apache module. This
distinction is not a standard terminology, and has the only purpose to avoid
confusion.
core (Apache core)part of the Apache server that defines and manages the steps in answering
the request and implements the HTTP protocol.
CGI (CGI script)Common Gateway Interface, an interface describing how a web server passes
parameters and receive results form another process on the same machine called
CGI-script (executed by the web server when it receive a request referencing the
script).
handlera function of a module that will be implicitly invoked by the core to handle
the phase of processing the HTTP request for which the handler was designed.
HTTPHypertext Transport Protocol, the protocol that coordinate how the hypertext
files are transfered over the Internet. However any files can be transfered via
HTTP.
httpdthe usual name for the web server (stands for HTTP daemon).
IPC (IPC mechanisms)inter process communication mechanisms (e.g. queues, semaphores, shared
memory)
MIME typeMIME stands for Multipurpose Internet Mail Extension. MIME types are the
types (e.g. gif, html) of the entities defined in MIME request for comments
module (Apache module)part of Apache server that provides some functionality in one or more phases
of servicing an HTTP request. Its functions (handler) are implicitly invoked by
the Apache core. It is interfaced with the Apache core by a special API.
NCSA web server (NCSA httpd)the web server provided and maintained by the Development Group of the
National Center for Super-computing Applications, at the University of Illinois
at Urbana - Champaign
request (HTTP client request)a message from the client containing information about the resource
requested and how it is wanted to be delivered.
resource (an HTTP resource)a network data object or service which can be identified by a URI
response (HTTP server response)the response from the web server to an HTTP request, contains a header and
usually the actual resource. The header contains status information and
information on the resource (e.g. type, length of the binary representation).
resource poolA large data structure allocated in one step by the Apache core, which holds
the resources (memory blocks, open files) associated with a given request. When
the resource pool is no longer needed it is deallocated in one step (memory is
freed and files ore closed).
URIUniversal Resource Identifiers, are formated (fixed syntax) string which
identify objects via location, and other characteristics.
URLUniform Resource Locators, a subclass of URI that locates resources based on
their location and the protocol used to fetch them (e.g.
http://www.uwaterloo.ca/index.html identifies the home page file of University
of Waterloo)
virtual hosta single physical host might have more than one network interface, each with
a different IP address and a different host name. For clients it acts as being a
number of virtual hosts, one for each name.
[Thau96]Design
considerations for the Apache Server API, Robert Thau, Fifth
International World Wide Web Conference, 1996, Paris.
[APINotes]Apache API notes
, Robert S. Thau.
[ApacheDocs]Apache server documentation[Garlan94]An
Introduction to Software Architecture, D. Garlan, M.Shaw, Advances in
Software Engineering and Knowledge Engineering, Vol. I, World Scientific
Publishing Company, 1993.
[Monroe97]Architectural
Styles, Design Patterns, and Objects, R. Monroe, D. Kompanek, R. Melton,
D. Garlan, IEEE Software, January 1997, pp 43-52.
[Shaw96]A
Field Guide to Boxology, M. Shaw, P. Clementes, 1996
|----------------------------------------------------------------------------------------|
版权声明 版权所有 @zhyiwww
引用请注明来源 http://www.blogjava.net/zhyiwww
|----------------------------------------------------------------------------------------|
posted on 2008-05-09 16:30
zhyiwww 阅读(1614)
评论(0) 编辑 收藏 所属分类:
software