随笔-295  评论-26  文章-1  trackbacks-0

最近发现用htmlparser解析一些网页时,繁体中文会变成乱码.分析了下原因,发现在用stringbean的时候htmlparser会自己根据meta来决定用哪种内码来解码,而有的网站在meta中是用gb2312来做charset,实际应用的时候又用到了gbk.gb2312是不能表示繁体的,所以就出现了乱码.解决的办法很简单,gbk是兼容gb2312的,所以在htmlparser的page.java的getcharser()那里加一句判断,如果ret是gb2312就设置为gbk,这样问题就解决了. 

修改的page.java的代码如下(/lexer/page.java)


    public String getCharset (String content)
    {
        final String CHARSET_STRING = "charset";
        int index;
        String ret;

        if (null == mSource)
            ret = DEFAULT_CHARSET;
        else
            // use existing (possibly supplied) character set:
            // bug #1322686 when illegal charset specified
            ret = mSource.getEncoding ();
        if (null != content)
        {
            index = content.indexOf (CHARSET_STRING);

            if (index != -1)
            {
                content = content.substring (index +
                    CHARSET_STRING.length ()).trim ();
                if (content.startsWith ("="))
                {
                    content = content.substring (1).trim ();
                    index = content.indexOf (";");
                    if (index != -1)
                        content = content.substring (0, index);

                    //remove any double quotes from around charset string
                    if (content.startsWith ("\"") && content.endsWith ("\"")
                        && (1 < content.length ()))
                        content = content.substring (1, content.length () - 1);

                    //remove any single quote from around charset string
                    if (content.startsWith ("'") && content.endsWith ("'")
                        && (1 < content.length ()))
                        content = content.substring (1, content.length () - 1);

                    ret = findCharset (content, ret);

                    // Charset names are not case-sensitive;
                    // that is, case is always ignored when comparing
                    // charset names.
//                    if (!ret.equalsIgnoreCase (content))
//                    {
//                        System.out.println (
//                            "detected charset \""
//                            + content
//                            + "\", using \""
//                            + ret
//                            + "\"");
//                    }
                }
            }
        }
        if(ret.equalsIgnoreCase("gb2312"))ret="GBK"; //to avoid decode problem
                                                                                           //edited by linyunfan
        return (ret);
    }

 

在最后加入了这句

        if(ret.equalsIgnoreCase("gb2312"))ret="GBK";



大盘预测 国富论
posted on 2008-10-09 13:33 华梦行 阅读(1764) 评论(3)  编辑  收藏

评论:
# re: htmlparser解析一些网页时,繁体中文会变成乱码 2008-12-28 22:01 | 繁体
.。。  回复  更多评论
  
# re: htmlparser解析一些网页时,繁体中文会变成乱码 2009-03-11 15:45 | pnut
不错。完全可以!
用parser.setEncoding("gbk")是不起作用的,程序运行时会用所抓网页的“charset”覆盖的。
gbk是gb2312的超集,所以用gbk去处理gb2312的网页完全没有问题,事实上,众多网页上所写的gb2312是不严谨的,IE也总是会忽略它而用gbk去展示的。但目前现状是很多网站都在写gb2312,用博主的办法处理是个好主意。
补充一下,Page.java在httplexer.jar里,源码可以去“http://sourceforge.net/projects/htmlparser/” 下载  回复  更多评论
  
# re: htmlparser解析一些网页时,繁体中文会变成乱码[未登录] 2011-08-07 15:53 | 小武
我改成楼主说的那样了,还是不得呢?
QQ:1161008015 可以加我qq交流一下嘛。  回复  更多评论
  

只有注册用户登录后才能发表评论。


网站导航: