html的解析以及nekohtml的使用 -

MDA/MDD/TDD/DDD/DDDDDDD

posts - 536, comments - 111, trackbacks - 0, articles - 0

html的解析以及nekohtml的使用

Posted on 2008-02-21 18:29 leekiang 阅读(2902) 评论(0) 编辑收藏所属分类: 文件处理

    import org.cyberneko.html.parsers.DOMFragmentParser;
     import org.apache.html.dom.HTMLDocumentImpl;
     import org.w3c.dom.DocumentFragment;
     import org.w3c.dom.Node;
     import org.w3c.dom.NodeList;
     import org.xml.sax.InputSource;
     import org.xml.sax.SAXException;

     /**
     * 从html中抽取纯文本
     *
     * @param content
     * @return
     * @throws UnsupportedEncodingException
      */
     public String extractTextFromHTML(String content)
             throws UnsupportedEncodingException {
        DOMFragmentParser parser = new DOMFragmentParser();
        DocumentFragment node = new HTMLDocumentImpl().createDocumentFragment();
        InputStream is = new ByteArrayInputStream(content.getBytes());
         try {
            parser.parse( new InputSource(is), node);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SAXException se) {
            se.printStackTrace();
        }

        StringBuffer newContent = new StringBuffer();
         this .getText(newContent, node);

        String str = ( new String(
                newContent.toString().getBytes( " Windows-1252 " ), " GBK " ));
         return str;
    }

     private void getText(StringBuffer sb, Node node) {
         if (node.getNodeType() == Node.TEXT_NODE) {
            sb.append(node.getNodeValue());
        }
        NodeList children = node.getChildNodes();
         if (children != null ) {
             int len = children.getLength();
             for ( int i = 0 ; i < len; i ++ ) {
                getText(sb, children.item(i));
            }
        }
    }

1，nekohtml1.9.6.1版本用到了jdk5的Arrays.hashCode等方法，为兼容jdk1.4,
故采用nekohtml1.9.6版本
2，需要xerces.jar支持
3，
   http://hi.baidu.com/walkandsing/blog/item/f5743634c6ba2e3a5bb5f5e5.html
   http://blog.csdn.net/zhou2002/archive/2008/01/19/2053911.aspx
   http://playfish.javaeye.com/blog/150184

4,
python解析html
http://lenciel.cn/docs/python-parser-of-xml/
http://hi.baidu.com/javalang/blog/item/84bac4bf731fb80f18d81fe1.html
ruby用hpricot

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: 正则表达式笔记 dom4j用法纯文本文档格式 POI处理Excel公式的乱码问题用poi生成链接 YAML格式解析 vCalendar(ics)，vCard格式及Outlook poi笔记备忘 html的解析以及nekohtml的使用

html的解析以及nekohtml的使用

公告

常用链接

留言簿(19)

随笔分类(572)

随笔档案(536)

收藏

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜