Content syndication for the Web
Mike Olson (mike.olson@fourthought.com), Principal Consultant, Fourthought, Inc.
Uche Ogbuji (uche.ogbuji@fourthought.com), Principal Consultant, Fourthought, Inc.
13 Nov 2002
RSS
is one of the most successful XML services ever. Despite its chaotic
roots, it has become the community standard for exchanging content
information across Web sites. Python is an excellent tool for RSS
processing, and Mike Olson and Uche Ogbuji introduce a couple of
modules available for this purpose.
RSS is an abbreviation with several expansions: "RDF Site Summary,"
"Really Simple Syndication," "Rich Site Summary," and perhaps others.
Behind this confusion of names is an astonishing amount of politics for
such a mundane technological area. RSS is a simple XML format for
distributing summaries of content on Web sites. It can be used to share
all sorts of information including, but not limited to, news flashes,
Web site updates, event calendars, software updates, featured content
collections, and items on Web-based auctions.
RSS was created by Netscape in 1999 to allow content to be gathered
from many sources into the Netcenter portal (which is now defunct). The
UserLand community of Web enthusiasts became early supporters of RSS,
and it soon became a very popular format. The popularity led to strains
over how to improve RSS to make it even more broadly useful. This
strain led to a fork in RSS development. One group chose an approach
based on RDF, in order to take advantage of the great number of RDF
tools and modules, and another chose a more stripped-down approach. The
former is called RSS 1.0, and the latter RSS 0.91. Just last month the
battle flared up again with a new version of the non-RDF variant of
RSS, which its creators are calling "RSS 2.0."
RSS 0.91 and 1.0 are very popular, and used in numerous portals and
Web logs. In fact, the blogging community is a great user of RSS, and
RSS lies behind some of the most impressive networks of XML exchange in
existence. These networks have grown organically, and are really the
most successful networks of XML services in existence. RSS is a XML
service by virtue of being an exchange of XML information over an
Internet protocol (the vast majority of RSS exchange is simple HTTP GET
of RSS documents). In this article, we introduce just a few of the many
Python tools available for working with RSS. We don't provide a
technical introduction to RSS, because you can find this in so many
other articles (see Resources).
We recommend first that you gain a basic familiarity with RSS, and that
you understand XML. Understanding RDF is not required.
[We consider RSS an 'XML service' rather than a 'Web service' due to
the use of XML descriptions but the lack of use of WSDL. -- Editors]
RSS.py
Mark Nottingham's RSS.py is a Python library for RSS processing. It is
very complete and well-written. It requires Python 2.2 and PyXML 0.7.1.
Installation is easy; just download the Python file from Mark's home
page and copy it to somewhere in your PYTHONPATH
.
Most users of RSS.py need only concern themselves with two classes it provides: CollectionChannel
and TrackingChannel
. The latter seems the more useful of the two. TrackingChannel
is a data structure that contains all the RSS data indexed by the key of each item. CollectionChannel
is a similar data structure, but organized more as RSS documents
themselves are, with the top-level channel information pointing to the
item details using hash values for the URLs. You will probably use the
utility namespace declarations in the RSS.ns
structure. Listing 1
is a simple script that downloads and parses an RSS feed for Python
news, and prints out all the information from the various items in a
simple listing.
from RSS import ns, CollectionChannel, TrackingChannel
#Create a tracking channel, which is a data structure that #Indexes RSS data by item URL tc = TrackingChannel()
#Returns the RSSParser instance used, which can usually be ignored tc.parse("http://www.python.org/channews.rdf")
RSS10_TITLE = (ns.rss10, 'title') RSS10_DESC = (ns.rss10, 'description')
#You can also use tc.keys() items = tc.listItems() for item in items: #Each item is a (url, order_index) tuple url = item[0] print "RSS Item:", url #Get all the data for the item as a Python dictionary item_data = tc.getItem(item) print "Title:", item_data.get(RSS10_TITLE, "(none)") print "Description:", item_data.get(RSS10_DESC, "(none)")
|
We start by creating a TrackingChannel
instance, and then populate it with data parsed from the RSS feed at http://www.python.org/channews.rdf
.
RSS.py uses tuples as the property names for RSS data. This may seem an
unusual approach to those not used to XML processing techniques, but it
is actually a very useful way of being very precise about what was in
the original RSS file. In effect, an RSS 0.91 title
element is not considered to be equivalent to an RSS 1.0 one. There is
enough data for the application to ignore this distinction, if it
likes, by ignoring the namespace portion of each tuple; but the basic
API is wedded to the syntax of the original RSS file, so that this
information is not lost. In the code, we use this property data to
gather all the items from the news feed for display. Notice that we are
careful not to assume which properties any particular item might have.
We retrieve properties using the safe form as seen in the code below.
print "Title:", item_data.get(RSS10_TITLE, "(none)")
|
Which provides a default value if the property is not found, rather than this example.
print "Title:", item_data[RSS10_TITLE]
|
This precaution is necessary because you never know what elements are used in an RSS feed. Listing 2shows the output from Listing 1.
$ python listing1.py RSS Item: http://www.python.org/2.2.2/ Title: Python 2.2.2b1 Description: (none) RSS Item: http://sf.net/projects/spambayes/ Title: spambayes project Description: (none) RSS Item: http://www.mems-exchange.org/software/scgi/ Title: scgi 0.5 Description: (none) RSS Item: http://roundup.sourceforge.net/ Title: Roundup 0.4.4 Description: (none) RSS Item: http://www.pygame.org/ Title: Pygame 1.5.3 Description: (none) RSS Item: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/ Title: Pyrex 0.4.4.1 Description: (none) RSS Item: http://www.tundraware.com/Software/hb/ Title: hb 1.88 Description: (none) RSS Item: http://www.tundraware.com/Software/abck/ Title: abck 2.2 Description: (none) RSS Item: http://www.terra.es/personal7/inigoserna/lfm/ Title: lfm 0.9 Description: (none) RSS Item: http://www.tundraware.com/Software/waccess/ Title: waccess 2.0 Description: (none) RSS Item: http://www.krause-software.de/jinsitu/ Title: JinSitu 0.3 Description: (none) RSS Item: http://www.alobbs.com/pykyra/ Title: PyKyra 0.1.0 Description: (none) RSS Item: http://www.havenrock.com/developer/treewidgets/index.html Title: TreeWidgets 1.0a1 Description: (none) RSS Item: http://civil.sf.net/ Title: Civil 0.80 Description: (none) RSS Item: http://www.stackless.com/ Title: Stackless Python Beta Description: (none)
|
Of course, you would expect somewhat different output because the
news items will have changed by the time you try it. The RSS.py channel
objects also provide methods for adding and modifying RSS information.
You can write the result back to RSS 1.0 format using the output()
method. Try this out by writing back out the information parsed in Listing 1. Kick off the script in interactive mode by running: python -i listing1.py
. At the resuting Python prompt, run the following example.
>>> result = tc.output(items) >>> print result
|
The result is an RSS 1.0 document printed out. You must have RSS.py,
version 0.42 or more recent for this to work. There is a bug in the output()
method in earlier versions.
rssparser.py
Mark Pilgrim offers another module for RSS file parsing. It doesn't
provide all the features and options that RSS.py does, but it does
offer a very liberal parser, which deals well with all the confusing
diversity in the world of RSS. To quote from the rssparser.py page:
You
see, most RSS feeds suck. Invalid characters, unescaped ampersands
(Blogger feeds), invalid entities (Radio feeds), unescaped and invalid
HTML (The Register's feed most days). Or just a bastardized mix of RSS
0.9x elements with RSS 1.0 elements (Movable Type feeds).
Then
there are feeds, like Aaron's feed, which are too bleeding edge. He
puts an excerpt in the description element but puts the full text in
the content:encoded element (as CDATA). This is valid RSS 1.0, but
nobody actually uses it (except Aaron), few news aggregators support
it, and many parsers choke on it. Other parsers are confused by the new
elements (guid) in RSS 0.94 (see Dave Winer's feed for an example). And
then there's Jon Udell's feed, with the fullitem
element that he just sort of made up.
It's funny to consider this in the light of the fact that XML and
Web services are supposed to increase interoperability. Anyway,
rssparser.py is designed to deal with all the madness.
Installing rssparser.py is also very easy. You download the Python
file (see Resources), rename it from "rssparser.py.txt" to
"rssparser.py", and copy it to your PYTHONPATH
. I also
suggest getting the optional timeoutsocket module which improves the
timeout behavior of socket operations in Python, and thus can help
getting RSS feeds less likely to stall the application thread in case
of error.
Listing 3 is a script that is the equivalent of Listing 1, but using rssparser.py, rather than RSS.py.
import rssparser #Parse the data, returns a tuple: (data for channels, data for items) channel, items = rssparser.parse("http://www.python.org/channews.rdf")
for item in items: #Each item is a dictionary mapping properties to values print "RSS Item:", item.get('link', "(none)") print "Title:", item.get('title', "(none)") print "Description:", item.get('description', "(none)")
|
As you can see, the code is much simpler. The trade-off between
RSS.py and rssparser.py is largely that the former has more features,
and maintains more syntactic information from the RSS feed. The latter
is simpler, and a more forgiving parser (the RSS.py parser only accepts
well-formed XML).
The output should be the same as in Listing 2.
Conclusion
There are many Python tools for RSS, and we don't have space to cover
them all. Aaron Swartz's page of RSS tools is a good place to start
looking if you want to explore other modules out there. RSS is easy to
work with in Python, because of all the great modules available for it.
The modules hide all the chaos brought about by the history and
popularity of RSS. If your XML services needs mostly involve the
exchange of descriptive information for Web sites, we highly recommend
using the most successful XML service technology in employment.
Next month, we will explain how to use e-mail packages for Python for writing Web services over SMTP.
Resources
- Participate in the discussion forum on this article. (You can also click Discuss at the top or bottom of the article to access the forum.)
- Check out the previous installments of The Python Web services developer columns.
- There are several resources on RSS in IBM developerWorks.
- XML.com also has several articles on RSS. Read RSS: Lightweight Web Syndication, by Rael Dornfest, for a good general introduction. In Building a Semantic Web Site, Eric van der Vlist provides an great technical introduction based on very practical examples. RSS Modularization, by Leigh Dodds, follows some very interesting conversation at a crucial juncture in RSS development.
- Mark Nottingham is the author of RSS.py, and has a lot of other handy stuff on his home page, including an excellent RSS Tutorial for Content Publishers and Webmasters.
- Mark Pilgrim is the author of rssparser.py, an "ultra liberal" RSS parser. The code is available as a text download. If you install it, I also recommend getting timeoutsocket.py.
- Fredrik Lundh, the author of xmlrpclib.py and soaplib.py, is working on The EffNews Project: Building an RSS Newsreader, a python project for creating a GUI front end for reading news from RSS feeds.
- Peerkat is a resource aggregator written in Python that allows people to use RSS to manage the Web content they follow.
- Aaron Swartz maintains a list of RSS tools for all languages and platforms.
About the authors
Mike Olson is a consultant and co-founder of Fourthought Inc.,
a software vendor and consultancy specializing in XML solutions for
enterprise knowledge management applications. Fourthought develops 4Suite, an open source
platform for XML middleware. You can contact Mr. Olson at mike.olson@fourthought.com. |
Uche Ogbuji is a consultant and co-founder of Fourthought Inc.,
a software vendor and consultancy specializing in XML solutions for
enterprise knowledge management applications. Fourthought develops 4Suite,
an open source
platform for XML middleware. Mr. Ogbuji is a Computer Engineer and
writer born in Nigeria, living and working in Boulder, Colorado, USA.
You can contact Mr. Ogbuji at uche.ogbuji@fourthought.com.
|