weidagang2046的专栏

物格而后知致
随笔 - 8, 文章 - 409, 评论 - 101, 引用 - 0
数据加载中……

Anti-Crawler Script

Description

An ASP script which can be adapted to keep specific crawlers (robots/spiders) out of your ASP-based website, or you can apply the rules to single a page. The latest version of the script allows you to filter sources by both user-agent (either full matching or partial/wildcard matching via regular expressions) or by IP address (again either full or partial matching is supported).

Although the script ships with a default set of rules installed to allow it to work reliably without any changes, you can quite easily remove these rules if you don't like them - creating your own powerful set of rules is remarkably easy.

Why ban a crawler / spider / application from accessing your website? Lots of reasons - sometimes it's obvious they are only using your site to harvest email addresses, other times they might be using up excessive amounts of resources or going into areas they are supposed to be excluded from. Banning them from the site keeps them from your data, it also helps reduce the bandwidth they consume - this is even more true as if their first request is denied they cannot discover & retrieve the rest of your site.

What sort of crawlers / spiders / applications are we talking about? Most of the time the elements people want to ban aren't associated with any major search engines, and for those few that do offer a search service I doubt they bring very little (if any) traffic to the sites they crawl. The majority of the elements that get banned really only exist to further their own goals and wont help your site in the long run.

Requirements

  • IIS
  • RegExp object

Single Compressed Download

Individual Components

Installation & Setup

  1. Save the source code file into a directory somewhere within your webroot - for any examples we've assumed the resulting file is called denycrawler.asp.
  2. Include the code into either a pre-existing common include file or a single page (e.g. <!-- #INCLUDE VIRTUAL="/myfolder/denycrawler.asp" -->). As the file doesn't include any code which will run automatically placement above or below existing includes shouldn't be an issue.
  3. Finally, call the function DenyCrawler() from within your include or page. In order to function correctly this needs to be called before any headers or page content is written - this ensures that if it needs to deny a request it can respond with a minimal page complete with an explanation.

User Guide

The majority of this script relies on pattern matching, specifically regular expressions which I've gathered over time based on historical traffic for this site. While the default ruleset isn't perfect it allows most users to immediately use the script - if you feel comfortable with writing regular expressions or just want to stop one specific source of requests feel free to erase the defaults.

If you need to test the deny function for yourself on a development system then just add an extra line into BadUA_Test - for example UA_Add "Mozilla", sUserAgentList will match the majority of browsers allowing you to view the deny screen in action - however don't try this on a live system because you'll block all traffic!

Equally if you ever want to ban an IP address then it works in much the same way - there are several examples listed inside BadIP_Test should you need to attempt this.

Now onto the technical part.

If you ever need to get involved with writing a lot of complex rules the main thing to remember is in order to save time the script merges all the unique regular expressions into one large expression and uses the OR logical operator to combine them allowing one rule to be used rather than cycling through several different rules. However coding the data into the script in this way would make it hard to read and even harder to maintain, so instead UA_Add is used to build these strings on-the-fly. It has two parameters - your regular expression string followed by the variable being used to hold the combined string.

The startpoint for the script is DenyCrawler() which uses EmptyUA_Test, BadUA_Test and BadIP_Test to determine if this request should be served, these rules are checked in the order listed above.

EmptyUA_Test is just a simple piece of logic which checks if an empty or single-character user-agent string is being used, if that comes back negative BadUA_Test is called.

BadUA_Test checks if the current user-agent string matches any of the elements in a list of regular expressions, this provides flexibity to use exact matches, partial matches or any other type of pattern you're capable of creating.

BadIP_Test provides an IP filtering element, working in a similar way to BadUA_Test in as much as it takes a series of regular expressions which describe the IP addresses you want to ban. There are no default rules included in this function, it's designed to allow a user the ability to filter out an element of their traffic with a high level of accuracy - something that wasn't possible with just the user-agent based tests.

Related Links

  • Defend your e-mail address - an ASP script which uses the same type of ruleset in conjunction with other filters to make it a lot harder for e-mail harvesters to scrape your e-mail address off your webpages.
Evolved
Code
ASP, SQL & VB meet the internet.

Navigate

Home Parent Directory Meta-Search

Technical

ASP Scripts SQL Scripts VB Programs Show All

Guides

Show All

Other

Contact Site News About Legal Sitemap Links

posted on 2006-12-18 21:42 weidagang2046 阅读(975) 评论(0)  编辑  收藏 所属分类: Search Engine


只有注册用户登录后才能发表评论。


网站导航: