htmlparser解析一些网页时,繁体中文会变成乱码

最近发现用htmlparser解析一些网页时,繁体中文会变成乱码.分析了下原因,发现在用stringbean的时候htmlparser会自己根据meta来决定用哪种内码来解码,而有的网站在meta中是用gb2312来做charset,实际应用的时候又用到了gbk.gb2312是不能表示繁体的,所以就出现了乱码.解决的办法很简单,gbk是兼容gb2312的,所以在htmlparser的page.java的getcharser()那里加一句判断,如果ret是gb2312就设置为gbk,这样问题就解决了.

修改的page.java的代码如下(/lexer/page.java)

    public String getCharset (String content)
    {
        final String CHARSET_STRING = "charset";
        int index;
        String ret;

        if (null == mSource)
            ret = DEFAULT_CHARSET;
        else
            // use existing (possibly supplied) character set:
            // bug #1322686 when illegal charset specified
            ret = mSource.getEncoding ();
        if (null != content)
        {
            index = content.indexOf (CHARSET_STRING);

            if (index != -1)
            {
                content = content.substring (index +
                    CHARSET_STRING.length ()).trim ();
                if (content.startsWith ("="))
                {
                    content = content.substring (1).trim ();
                    index = content.indexOf (";");
                    if (index != -1)
                        content = content.substring (0, index);

                    //remove any double quotes from around charset string
                    if (content.startsWith ("\"") && content.endsWith ("\"")
                        && (1 < content.length ()))
                        content = content.substring (1, content.length () - 1);

                    //remove any single quote from around charset string
                    if (content.startsWith ("'") && content.endsWith ("'")
                        && (1 < content.length ()))
                        content = content.substring (1, content.length () - 1);

ret = findCharset (content, ret);

                    // Charset names are not case-sensitive;
                    // that is, case is always ignored when comparing
                    // charset names.
//                    if (!ret.equalsIgnoreCase (content))
//                    {
//                        System.out.println (
//                            "detected charset \""
//                            + content
//                            + "\", using \""
//                            + ret
//                            + "\"");
//                    }
                }
            }
        }
        if(ret.equalsIgnoreCase("gb2312"))ret="GBK"; //to avoid decode problem
                                                                                        //edited by linyunfan
        return (ret);
    }

在最后加入了这句

if(ret.equalsIgnoreCase("gb2312"))ret="GBK";

大盘预测国富论

posted on 2008-10-09 13:33 华梦行阅读(1807) 评论(3) 编辑收藏

常用链接

留言簿(2)

随笔分类(91)

随笔档案(293)

友情链接

最新随笔

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理