Python, Java, Life, etc

A blog of technology and life.

BlogJava 首页 新随笔 联系 聚合 管理
  30 Posts :: 0 Stories :: 9 Comments :: 0 Trackbacks
Content syndication for the Web

Level: Introductory


Mike Olson (mike.olson@fourthought.com), Principal Consultant, Fourthought, Inc.
Uche Ogbuji (uche.ogbuji@fourthought.com), Principal Consultant, Fourthought, Inc.

13 Nov 2002

Column iconRSS is one of the most successful XML services ever. Despite its chaotic roots, it has become the community standard for exchanging content information across Web sites. Python is an excellent tool for RSS processing, and Mike Olson and Uche Ogbuji introduce a couple of modules available for this purpose.

RSS is an abbreviation with several expansions: "RDF Site Summary," "Really Simple Syndication," "Rich Site Summary," and perhaps others. Behind this confusion of names is an astonishing amount of politics for such a mundane technological area. RSS is a simple XML format for distributing summaries of content on Web sites. It can be used to share all sorts of information including, but not limited to, news flashes, Web site updates, event calendars, software updates, featured content collections, and items on Web-based auctions.

RSS was created by Netscape in 1999 to allow content to be gathered from many sources into the Netcenter portal (which is now defunct). The UserLand community of Web enthusiasts became early supporters of RSS, and it soon became a very popular format. The popularity led to strains over how to improve RSS to make it even more broadly useful. This strain led to a fork in RSS development. One group chose an approach based on RDF, in order to take advantage of the great number of RDF tools and modules, and another chose a more stripped-down approach. The former is called RSS 1.0, and the latter RSS 0.91. Just last month the battle flared up again with a new version of the non-RDF variant of RSS, which its creators are calling "RSS 2.0."

RSS 0.91 and 1.0 are very popular, and used in numerous portals and Web logs. In fact, the blogging community is a great user of RSS, and RSS lies behind some of the most impressive networks of XML exchange in existence. These networks have grown organically, and are really the most successful networks of XML services in existence. RSS is a XML service by virtue of being an exchange of XML information over an Internet protocol (the vast majority of RSS exchange is simple HTTP GET of RSS documents). In this article, we introduce just a few of the many Python tools available for working with RSS. We don't provide a technical introduction to RSS, because you can find this in so many other articles (see Resources). We recommend first that you gain a basic familiarity with RSS, and that you understand XML. Understanding RDF is not required.

[We consider RSS an 'XML service' rather than a 'Web service' due to the use of XML descriptions but the lack of use of WSDL. -- Editors]

RSS.py
Mark Nottingham's RSS.py is a Python library for RSS processing. It is very complete and well-written. It requires Python 2.2 and PyXML 0.7.1. Installation is easy; just download the Python file from Mark's home page and copy it to somewhere in your PYTHONPATH.

Most users of RSS.py need only concern themselves with two classes it provides: CollectionChannel and TrackingChannel. The latter seems the more useful of the two. TrackingChannel is a data structure that contains all the RSS data indexed by the key of each item. CollectionChannel is a similar data structure, but organized more as RSS documents themselves are, with the top-level channel information pointing to the item details using hash values for the URLs. You will probably use the utility namespace declarations in the RSS.ns structure. Listing 1 is a simple script that downloads and parses an RSS feed for Python news, and prints out all the information from the various items in a simple listing.



from RSS import ns, CollectionChannel, TrackingChannel

#Create a tracking channel, which is a data structure that
#Indexes RSS data by item URL
tc = TrackingChannel()

#Returns the RSSParser instance used, which can usually be ignored
tc.parse("http://www.python.org/channews.rdf")

RSS10_TITLE = (ns.rss10, 'title')
RSS10_DESC = (ns.rss10, 'description')

#You can also use tc.keys()
items = tc.listItems()
for item in items:
#Each item is a (url, order_index) tuple
url = item[0]
print "RSS Item:", url
#Get all the data for the item as a Python dictionary
item_data = tc.getItem(item)
print "Title:", item_data.get(RSS10_TITLE, "(none)")
print "Description:", item_data.get(RSS10_DESC, "(none)")



We start by creating a TrackingChannel instance, and then populate it with data parsed from the RSS feed at http://www.python.org/channews.rdf. RSS.py uses tuples as the property names for RSS data. This may seem an unusual approach to those not used to XML processing techniques, but it is actually a very useful way of being very precise about what was in the original RSS file. In effect, an RSS 0.91 title element is not considered to be equivalent to an RSS 1.0 one. There is enough data for the application to ignore this distinction, if it likes, by ignoring the namespace portion of each tuple; but the basic API is wedded to the syntax of the original RSS file, so that this information is not lost. In the code, we use this property data to gather all the items from the news feed for display. Notice that we are careful not to assume which properties any particular item might have. We retrieve properties using the safe form as seen in the code below.



print "Title:", item_data.get(RSS10_TITLE, "(none)")

Which provides a default value if the property is not found, rather than this example.



print "Title:", item_data[RSS10_TITLE]

This precaution is necessary because you never know what elements are used in an RSS feed. Listing 2shows the output from Listing 1.



$ python listing1.py
RSS Item: http://www.python.org/2.2.2/
Title: Python 2.2.2b1
Description: (none)
RSS Item: http://sf.net/projects/spambayes/
Title: spambayes project
Description: (none)
RSS Item: http://www.mems-exchange.org/software/scgi/
Title: scgi 0.5
Description: (none)
RSS Item: http://roundup.sourceforge.net/
Title: Roundup 0.4.4
Description: (none)
RSS Item: http://www.pygame.org/
Title: Pygame 1.5.3
Description: (none)
RSS Item: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
Title: Pyrex 0.4.4.1
Description: (none)
RSS Item: http://www.tundraware.com/Software/hb/
Title: hb 1.88
Description: (none)
RSS Item: http://www.tundraware.com/Software/abck/
Title: abck 2.2
Description: (none)
RSS Item: http://www.terra.es/personal7/inigoserna/lfm/
Title: lfm 0.9
Description: (none)
RSS Item: http://www.tundraware.com/Software/waccess/
Title: waccess 2.0
Description: (none)
RSS Item: http://www.krause-software.de/jinsitu/
Title: JinSitu 0.3
Description: (none)
RSS Item: http://www.alobbs.com/pykyra/
Title: PyKyra 0.1.0
Description: (none)
RSS Item: http://www.havenrock.com/developer/treewidgets/index.html
Title: TreeWidgets 1.0a1
Description: (none)
RSS Item: http://civil.sf.net/
Title: Civil 0.80
Description: (none)
RSS Item: http://www.stackless.com/
Title: Stackless Python Beta
Description: (none)

Of course, you would expect somewhat different output because the news items will have changed by the time you try it. The RSS.py channel objects also provide methods for adding and modifying RSS information. You can write the result back to RSS 1.0 format using the output() method. Try this out by writing back out the information parsed in Listing 1. Kick off the script in interactive mode by running: python -i listing1.py . At the resuting Python prompt, run the following example.



>>> result = tc.output(items)
>>> print result

The result is an RSS 1.0 document printed out. You must have RSS.py, version 0.42 or more recent for this to work. There is a bug in the output() method in earlier versions.

rssparser.py
Mark Pilgrim offers another module for RSS file parsing. It doesn't provide all the features and options that RSS.py does, but it does offer a very liberal parser, which deals well with all the confusing diversity in the world of RSS. To quote from the rssparser.py page:

You see, most RSS feeds suck. Invalid characters, unescaped ampersands (Blogger feeds), invalid entities (Radio feeds), unescaped and invalid HTML (The Register's feed most days). Or just a bastardized mix of RSS 0.9x elements with RSS 1.0 elements (Movable Type feeds).
Then there are feeds, like Aaron's feed, which are too bleeding edge. He puts an excerpt in the description element but puts the full text in the content:encoded element (as CDATA). This is valid RSS 1.0, but nobody actually uses it (except Aaron), few news aggregators support it, and many parsers choke on it. Other parsers are confused by the new elements (guid) in RSS 0.94 (see Dave Winer's feed for an example). And then there's Jon Udell's feed, with the fullitem element that he just sort of made up.

It's funny to consider this in the light of the fact that XML and Web services are supposed to increase interoperability. Anyway, rssparser.py is designed to deal with all the madness.

Installing rssparser.py is also very easy. You download the Python file (see Resources), rename it from "rssparser.py.txt" to "rssparser.py", and copy it to your PYTHONPATH. I also suggest getting the optional timeoutsocket module which improves the timeout behavior of socket operations in Python, and thus can help getting RSS feeds less likely to stall the application thread in case of error.

Listing 3 is a script that is the equivalent of Listing 1, but using rssparser.py, rather than RSS.py.



import rssparser
#Parse the data, returns a tuple: (data for channels, data for items)
channel, items = rssparser.parse("http://www.python.org/channews.rdf")

for item in items:
#Each item is a dictionary mapping properties to values
print "RSS Item:", item.get('link', "(none)")
print "Title:", item.get('title', "(none)")
print "Description:", item.get('description', "(none)")



As you can see, the code is much simpler. The trade-off between RSS.py and rssparser.py is largely that the former has more features, and maintains more syntactic information from the RSS feed. The latter is simpler, and a more forgiving parser (the RSS.py parser only accepts well-formed XML).

The output should be the same as in Listing 2.

Conclusion
There are many Python tools for RSS, and we don't have space to cover them all. Aaron Swartz's page of RSS tools is a good place to start looking if you want to explore other modules out there. RSS is easy to work with in Python, because of all the great modules available for it. The modules hide all the chaos brought about by the history and popularity of RSS. If your XML services needs mostly involve the exchange of descriptive information for Web sites, we highly recommend using the most successful XML service technology in employment.

Next month, we will explain how to use e-mail packages for Python for writing Web services over SMTP.

Resources

About the authors
Photo of Mike Olson Mike Olson is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. You can contact Mr. Olson at mike.olson@fourthought.com.


Photo of Uche Ogbuji Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche.ogbuji@fourthought.com.

posted on 2005-02-17 02:48 pyguru 阅读(344) 评论(0)  编辑  收藏 所属分类: Build Website

只有注册用户登录后才能发表评论。


网站导航: