2009年11月4日随笔档案 - 北溟有鱼

posts - 2,comments - 8,trackbacks - 0

2009年11月4日

乱码问题总算解决了。

下面这段代码用来获取文章内容，并通过NekoHTML来解析获得去掉HTML标签的文章内容.标红的地方就是用来设置字符集的，第一个是XML格式的字符集（似乎没什么用），第二个地方是将字符串的内容通过输入流读入，如果不指定的话在GAE中默认的是ISO-8859-1（本地的话以设置的文件的字符类型为主），第三个地方是设置XML解析器的字符集。昨晚就是第二个地方没有设置，导致乱码。在测试的过程中还学到一点：GBK->ISO-8859-1 的过程是不可逆的，也就是说如果把中文字符转成了ISO-8859-1的话，就再也转不过来了，中文变成了"????"。因此在保险起见，输入输出流在使用的时候最好都加上字符集。

1     public String getContent(String xwnr) throws Exception {
2         String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><content>" + xwnr + "</content>";
3         DOMFragmentParser parser = new DOMFragmentParser();
4         DocumentFragment node = new HTMLDocumentImpl().createDocumentFragment();
5
6         InputStream is = new ByteArrayInputStream(xml.getBytes("UTF-8"));
7
8         InputSource input = new InputSource(is);
9         input.setEncoding("UTF-8");
10         try {
11             parser.parse(input, node);
12         } catch (IOException e) {
13             e.printStackTrace();
14         } catch (SAXException se) {
15             se.printStackTrace();
16         }
17         StringBuffer newContent = new StringBuffer();
18         this.getText(newContent, node);
19
20         /*String str  =  ( new  String(
21                 newContent.toString().getBytes("Windows-1252"),  "UTF-8" ));*/
22
23         String str = newContent.toString();
24
25         if (str.length()>200){
26             return str.substring(0,200);
27         }else{
28             return str;
29         }
30     }

今天受到了不少关注，非常高兴，非常感谢支持我的同学们，我会慢慢的将开发的过程写出来与大家分享。乱码问题总算解决了。

posted @ 2009-11-04 01:29 渔人阅读(593) | 评论 (0) | 编辑收藏

常用链接

留言簿

随笔档案

文章档案

搜索

最新评论

阅读排行榜

评论排行榜