实现HTTP内容的抓取

前段时间做了一个网页爬虫，初次接触，收获了很多知识。其中关于HTTP协议的内容，记述如下：

        RFC2616中主要描述了HTTP 1.1协议。下面的描述没有实现其各个方面的内容，只提出了一种能够完成所有HTTP网页抓取的最小实现（不能够抓取HTTPS）。

        1、首先提交一个URL地址，分为普通的GET网页获取，POST的数据提交两种基本模式。

建立HttpWebReques实例，其中uri是网页的URL的地址：
   HttpWebRequest webrequest = (HttpWebRequest) WebRequest.Create(uri);

KeepAlive表示HTTP的连接是长连接：
   webrequest.KeepAlive = true;

如果需要，添加引用地址，主要用于防止其他网站的连接引用，比如登陆时，经常需要验证：
   if(referer!=null)
   {
    webrequest.Referer=referer;
   }

选择数据的提交方式，有GET、POST两种方式，HEAD不常用：
   switch(RequestMethod)
   {
    case 1:
     webrequest.Method="GET";
     break;
    case 2:
     webrequest.Method="POST";
     break;
    case 3:
     webrequest.Method="HEAD";
     break;
    default:
     webrequest.Method="GET";
     break;
   }

设置User－Agent，经常遇到，在某些网站中，做了限制，User－Agent为空，则不能访问：
   webrequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322; .NET CLR 2.0.50215; fqSpider)";

添加其他的HTTP的Header信息，collHeader是一个NameValue的Collection：
   if(collHeader!=null&&collHeader.Count>0)
   {
    int iCount = collHeader.Count;
    string key;
    string keyvalue;

    for (int i=0; i < iCount; i++)
    {
     key = collHeader.Keys[i];
     keyvalue = collHeader[i];
     webrequest.Headers.Add(key, keyvalue);
    }
   }

设置Content－Type的内容，如果为POST，设置成application/x-www-form-urlencoded，如果是Get设置成text/html：
   if(webrequest.Method=="POST")
   {
    webrequest.ContentType="application/x-www-form-urlencoded";
   }
   else
   {
    webrequest.ContentType = "text/html";
   }


设置代理服务器地址和端口：
   if ((ProxyServer!=null) &&(ProxyServer.Length > 0))
   {
    webrequest.Proxy = new
     WebProxy(ProxyServer,ProxyPort);
   }

设置是否允许自动转移：
   webrequest.AllowAutoRedirect = true;

设置基本的登陆认证：
   if (NwCred)
   {
    CredentialCache wrCache =
     new CredentialCache();
    wrCache.Add(new Uri(uri),"Basic",
     new NetworkCredential(UserName,UserPwd));
    webrequest.Credentials = wrCache;
   }

设置Request的Cookie容器：
   webrequest.CookieContainer=Cookies;

设置POST数据：
   byte[] bytes = Encoding.ASCII.GetBytes(RequestData);
   webrequest.ContentLength=bytes.Length;

   Stream oStreamOut = webrequest.GetRequestStream();
   oStreamOut.Write(bytes,0,bytes.Length);
   oStreamOut.Close();

posted on 2010-01-20 01:30 becket_zheng 阅读(460) 评论(0) 编辑收藏所属分类: 网页web前端技术、C#

实现HTTP内容的抓取

常用链接

留言簿

随笔分类(138)

随笔档案(134)

文章分类(77)

文章档案(109)

email

常去的Blog

搜索

最新评论

阅读排行榜

评论排行榜


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: HTML5+CSS3 W3C规范<<中文版>>参考手册(提供下载) IE6/IE7/FF的CSS hack 浏览器兼容总 Web 调试代理软件Fiddler 一个http调试代理，能够记录所有的你电脑和互联网之间的http通讯使用js获取QueryString JSON in .Net 使用Google CDN服务提供的jQuery库 Google与Microsoft为jQuery,Prototype,MooTools等类库提供CDN服务 C#中操作XML文件(读写改删全接触)-全了！非常有用的JS事件功能(转) 无阻塞下载脚本