Abstract
The Rich Site Summary (RSS) format, previously known as the RDF Site
Summary, has quietly become the dominant format for distributing news
headlines on the Web.
In this Mother of Perl tutorial, we will write a short Perl script
(less than 100 lines) that retrieves an XML RSS file from the Web or
local file system and converts it to HTML. Using a Server Side Include
(SSI) or similar method, you can easily add news headlines from any
number of sources to your Web site.
History
Where did RSS come from you ask? Netscape invented the RSS format for "channels" on Netscape Netcenter (http://my.netscape.com). It was released to the public in March of 1999. The first non-Netscape Web site to incorporate the new format was Scripting News, a popular technology news site run by Dave Winer, president of Userland Software
(think Frontier). Interestingly enough, Scripting News had been using
its own XML format, scriptingNews, since December of 1997.
In May of 1999, Dave Winer released a new version of the
scriptingNews XML format, which added new content-rich elements.
Netscape followed suit by adopting most of the new scriptingNews
elements into RSS 0.91, which was released in July of 1999.
Userland Software also rolled out their own flavor of my.netscape.com. If you haven't already guessed, it's available at http://my.userland.com.
As far as I know, RSS is the most widely used XML format on the
Web today. RSS headlines are available for many popular news sites like
Slashdot,
Forbes, and CNET News.com, and the list is growing daily.
In a time when "stickiness" is a good, displaying news headlines
on your Web site can really help give it the extra "umph" that will
encourage users to return. After all, users can only read your
president's bio but so many times.
Required Modules
For rss2html.pl to work on your system, you should have a recent
version of Perl installed, 5.003 or better. 5.005 is recommended. You
will also need the XML::Parser and XML::RSS modules installed.
To install the modules on a *nix system, type:
perl -MCPAN -e "install XML::Parser"
perl -MCPAN -e "install XML::RSS"
If you're using a win32 machine (Win95/98/NT), you have a recent
installation of Activestate Perl. If you don't have a recent version,
visit http://www.activestate.com.
To install XML::Parser on a win32 machine type:
ppm install XML-Parser
To install XML::RSS on a win32 machine (you must have a C compiler and nmake):
Next, we'll examine the RSS format in more detail.
rss2html.pl |
Get the source |
This script converts an RSS file on the Web or local file system to HTML. |
|
RSS 0.9
The first public version of RSS, 0.9, includes basic headline information.
Below is an example RSS file for Freshmeat.net, a popular news site for
Linux software:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://my.netscape.com/rdf/simple/0.9/"> <channel> <title>freshmeat.net</title> <link>http://freshmeat.net</link> <description>the one-stop-shop for all your Linux softwar needs</description> </channel> <image> <title>freshmeat.net</title> <url>http://freshmeat.net/images/fm.mini.jpg</url> <link>http://freshmeat.net</link> </image> <item> <title>Geheimnis 0.59</title> <link>http://freshmeat.net/news/1999/06/21/930004162.html</link> </item> <item> <title>Firewall Manager 1.3 PRO</title> <link>http://freshmeat.net/news/1999/06/21/930004148.html</link> </item> <textinput> <title>quick finder</title> <description>Use the text input below to search the fresh meat application database</description> <name>query</name> <link>http://core.freshmeat.net/search.php3</link> </textinput>
</rdf:RDF>
|
The first major element is channel
which contains
the following elements:
title
- the title of the channel
link
- the link to the channel Web site
description
- short description of the channel
An RSS channel may also contain an image
element as in the example above which contains the following elements:
title
- the text describing the image
url
- the URL of the image
link
- the URL that the image is linked to
The item
element contains the real channel
content which is comprised of a title
and a
link
element. An RSS file may contain up to
15 items.
An RSS 0.9 file may alternatively contain a textinput
element which allows users to type a string into a HTML text input field and
submit it via the HTTP GET method to the URL specified in the
link
element.
Next, we will examine RSS 0.91 which was released by Netscape in July
of 1999.
RSS 0.91
The latest version of RSS added a few new elements. Below is a
sample RSS file from XML.com,
an excellent XML resource site:
<?xml version="1.0"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
<channel> <title>XML News and Features from XML.com</title> <description>XML.com features a rich mix of information and services for the XML community.</description> <language>en-us</language> <link>http://xml.com/pub</link> <copyright>Copyright 1999, O'Reilly and Associates and Seybold Publications</copyright> <managingEditor>dale@xml.com (Dale Dougherty)</managingEditor> <webMaster>peter@xml.com (Peter Wiggin)</webMaster>
<image> <title>XML News and Features from XML.com</title> <url>http://xml.com/universal/images/xml_tiny.gif</url> <link>http://xml.com/pub</link> <width>88</width> <height>31</height> </image>
<item> <title>Issue: XML Data Servers</title> <link>http://xml.com/pub?wwwrrr_rss</link> <description>Although not everyone agrees that XML should become a full-fledged data-management discipline, object-database vendors are busy repositioning their object-database products as XML data servers. Jon Udell looks at one of these, Object Design's eXcelon and finds it a solid product.</description> </item>
<item> <title>O'Reilly Labs Review: Object Design's eXcelon 1.1</title> <link>http://xml.com/pub/1999/08/excelon/index.html?wwwrrr_rss</link> <description>Jon Udell takes a look at eXcelon, Object Design's XML data servers, and explains its user interface and general approach to XML. </description> </item>
<item> <title>Report from Montreal</title> <link>http://xml.com/pub/1999/08/excelon/montreal.html?wwwrrr_rss</link> <description>Lisa Rein reports from MetaStructures 99 and XML Developers' Day.</description> </item>
<item> <title>Reviews: Bluestone Software's XML Suite: Promising App, Rough Around the Edges</title> <link>http://xml.com/pub/1999/08/bluestone/index.html?wwwrrr_rss</link> <description>Our reviewer tested Bluestone's XML Suite (XML Server and Visual XML) on the Windows NT platform, simulating a two-way exchange of business information between a book publisher and book stores. The results were encouraging (with a few caveats).</description> </item>
<item> <title>Interviews: CBL: Ecommerce Componentry</title> <link>http://xml.com/pub/1999/08/glushko/glushko.html?wwwrrr_rss</link> <description>In this audio interview, Bob Glushko of Commerce One talks about the Common Business Library (CBL) as a set of building blocks for XML document types and schemas used in ecommerce.</description> </item>
<item> <title>Backends Sharing Data</title> <link>http://xml.com/pub/1999/08/rpc/index.html?wwwrrr_rss</link> <description>What if you could script remote procedure calls between web sites as easily as you can between programs? Edd Dumbill shows how it can be done in PHP.</description> </item>
<item> <title>Back Issue: XML Suite</title> <link>http://xml.com/pub/1999/08/18/index.html?wwwrrr_rss</link> <description> Barry Nance runs Bluestone's XML Suite through the paces. The tools show promise for passing data between databases and XML. But there are still a few kinks to be worked out.</description> </item>
<item> <title>Back Issue: XML-RPC</title> <link>http://xml.com/pub/1999/08/11/index.html?wwwrrr_rss</link> <description>A major promise of XML is its ability to pass data simply from one place to another, regardless of platform. In this issue, Edd Dumbill shows how to use XML-RPC in PHP to pass data from a web site to a PDA.</description> </item>
<item> <title>News: InDelv XML/XSL Client Version 0.4.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-a?wwwrrr_rss</link> <description> A posting from Rob Brown reports on the public availability of the new InDelv XML Client version 0.4. This version represent an upgrade to InDelv's previously released XML Browser, but "it has been renamed as a 'Client' to reflect the fact that it now contains both an XML/XSL browser and an XML/XSL editor. The browser is available free for all uses. The editor comes packaged with the browser as a demo, which can later be upgraded to a full commercial version. This is a 100% Java appl... </description> </item>
<item> <title>News: OpenJade Development Team Releases OpenJade 1.3pre1 (Beta).</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-g?wwwrrr_rss</link> <description> A recent posting from Avi Kivity and the OpenJade Development Team announced the release of OpenJade 1.3pre1 (Beta). "OpenJade is the DSSSL user community's open source implementation of DSSSL, Document Style Semantics and Specification Language, an ISO standard for rendering SGML and XML documents. OpenJade is based on James Clark's widely used Jade. OpenJade 1.3pre1 is a more complete implementation of the DSSSL standard, and introduces many new features, including (1) Implementat... </description> </item>
<item> <title>News: IBM XML Parser Update: XML4C2 Version 2.3.1 Released.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-b?wwwrrr_rss</link> <description> Dean Roddey posted an announcement for the update of XML4C. IBM's XML for C++ parser (XML4C) "is a validating XML parser written in a portable subset of C++. XML4C makes it easy to give an application the ability to read and write XML data. Its two shared libraries provide classes for parsing, generating, manipulating, and validating XML documents. XML4C is faithful to the XML 1.0 Recommendation and associated standards (DOM 1.0, SAX 1.0). Source code, samples and API documentation ... </description> </item>
<item> <title>News: Platform for Privacy Preferences (P3P) Specification Working Draft.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-h?wwwrrr_rss</link> <description> As part of the W3C P3P Activity, a fifth public working draft of the Platform for Privacy Preferences (P3P) Specification has been published for review by W3C members. The working draft "describes the Platform for Privacy Preferences (P3P). P3P enables Web sites to express their privacy practices and enables users to exercise preferences over those practices. P3P compliant products will allow users to be informed of site practices (in both machine and human readable formats), to deleg... </description> </item>
<item> <title>News: Extended XLink with XSLT.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-c?wwwrrr_rss</link> <description> Nikita Ogievetsky (President, Cogitech, Inc.) posted an announcement for the availability of slides from the Metastructures '99 presentation "HTML Form Templates with XML. All in One and One for All. XSLT template library for WEB applications." The paper describes building XSLT template library for web applications. The goal was to "demonstrate data processing on the web made easy with XSL transformations: Generate a data maintenance web with data-structure controlled by XML, scree... </description> </item>
<item> <title>News: HyBrick Web Site Reopens.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-d?wwwrrr_rss</link> <description> A posting from Toshimitsu Suzuki (Fujitsu Laboratories Ltd.) to the XLXP-DEV mailing list recently announced the reopening of the HyBrick Web site. 'HyBrick' is "an advanced SGML/XML browser developed by Fujitsu Laboratories, the research arm of Fujitsu. HyBrick is based on an architecture that supports advanced linking and formatting capabilities. HyBrick includes a DSSSL renderer and XLink/XPointer engine running on top of James Clark's SP and Jade. HyBrick supports: (1) Both v... </description> </item>
<item> <title>News: Extended DocBook Synopses Version 1.0.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-e?wwwrrr_rss</link> <description> Norman Walsh has posted an announcement for a preliminary release of 'Extended DocBook Synopses'. Extended DocBook Synopses is a customization layer that extends DocBook, "adding a function synopsis element, ClassSynopsis for modern, mostly object-oriented, programming languages such as Java, C++, Perl, and IDL." DocBook is an SGML [and XML] DTD maintained by the DocBook Technical Committee of OASIS that particularly well suited to books and papers about computer hardware and softwar... </description> </item>
</channel> </rss>
|
Notice that there are more descriptive elements for the channel, image,
amd items elements. These are referred to as "fat elements" because they
contain a more detailed description of each channel item.
The XML::RSS Module
Now that you've had a change to glance at two RSS examples, it's time to
introduct the XML::RSS module. XML::RSS is a subclass of XML::Parser,
a Perl module maintained by Clark Cooper that utilizes James Clark's
Expat C library. XML::RSS was developed to simplify the task of
manipulating and parsing RSS files. A deep understanding of XML is not
a prerequisite for using XML::RSS since the XML details are hidden
inside the class interface.
While XML::RSS is capable of creating RSS files, we will be
focusing on parsing existing RSS files in this column. You can read
more about the capabilities of XML::Parser in the module's
documentation or by typing:
perldoc XML::RSS
The Code
Well, let's look at the code shall we?
Lines 16-17 load the XML::RSS
and LWP::Simple modules. We've already talked about XML::RSS in brief, but
what does LWP::Simple do? Good question! The answer is simple (puns intended).
It's a procedural interface for interacting with a Web server. It's
also the little cousin of LWP::UserAgent, a fuller object oriented interface.
We'll be using one of the library's subroutines later in the code to fetch
an RSS file from the Web.
In lines 20-21 we initialize two
variables that we're going to use later.
Line 25 starts the main
code body. The first thing we do is verify that the user
typed exactly one command-line parameter. This parameter is then assigned
to the $arg
variable in
line 28.
Next we create a new instance of the XML::RSS class and assign the
reference to the $rss
variable on
line 31.
Now we must determine whether the command-line parameter the user
entered is an HTTP URL or a file on the local file system
(lines 34-46). On
line 34, we us a
regular expression to look for the characters http:
.
If the command-line argument starts with these characters, we can safely
assume that the user intends to retrieve an RSS file from a Web server.
On line 35 we pass the
argument to the get()
function, which is a part of
LWP::Simple, and assign the results to the $content
variable. On line 36 we call
die()
if $content
is empty. If this happens,
it means there was an error retrieving the RSS file. If the RSS file
was downloaded successfully, $rss->parse($content)
is called
which parses the RSS file and stores the results in the object's internal
structure (line 38).
If the command-line argument does not contain the http:
characters, we assume the argument is a file instead of a URL on
lines 41-46. The
first thing we do is assign the value of $arg
to the $file
variable and test for the existence of
the file (lines 42-43).
Then we call $rss->parsefile($file)
(line 45), which parses
the RSS file and stores the results in the object's internal structure.
The parsefile()
method parses a file, whereas the
parse()
method parses the string that's passed to it.
Lastly, we call the print_html
subroutine on
line 49, which converts
the RSS object in nicely formatted HTML.
print_html
As you examine this subroutine, you will begin to understand
the internal structure of the XML::RSS object. The critical portion
of the subroutine is contained on
lines 76-79. In this
foreach
loop, we iterate over each of the RSS items.
Next, let's take a look at rss2html.pl in action.
rss2html.pl in Action
I've added the following cron jobs that run once per hour on
the Webreference server (Scheduler is the NT counterpart):
rss2html.pl http://slashdot.org/slashdot.rdf > slashdot.html
rss2html.pl http://freshmeat.net/backend/fm.rdf > freshmeat.html
rss2html.pl http://www.linuxtoday.com/backend/my-netscape.rdf > linuxtoday.html
rss2html.pl http://www.xml.com/xml/news.rdf > xmlnews.html
rss2html.pl http://www.perlxml.com/rdf/moperl.rdf > mop.html
The commands above fetch the RSS files off the Web and convert them to
HTML. Using Server-Side Includes (SSI), I've included the results below:
Conclusion
Well, we've shown in this column that Perl can really pack a wallop
in a short amount of code. With rss2html.pl, anyone can automatically
add a news feed to their Web site.
For more information on RSS, you might try visiting the following sites:
rss2html.pl |
Get the source |
This script converts an RSS file on the Web or local file system to HTML. |
|