抽取word,pdf的几种方法

    1。用jacob.
    其实jacob是一个bridage，连接java和com或者win32函数的一个中间件，jacob并不能直接抽取word,excel等文件，需要自己写dll哦，不过已经有为你写好的了，就是jacob的作者一并提供了。
   jacob下载：
http://www.matrix.org.cn/down_view.asp?id=13
    下载了jacob并放到指定的路径之后(dll放到path,jar文件放到classpath)，就可以写你自己的抽取程序了，下面是一个例子：

import java.io.File;

import com.jacob.com.*;

import com.jacob.activeX.*;

public class FileExtracter

{

public static void main(String[] args)

{

ActiveXComponent app = new ActiveXComponent("Word.Application");

String inFile = "c:\\test.doc";

String tpFile = "c:\\temp.htm";

String otFile = "c:\\temp.xml";

boolean flag = false;

try

{

app.setProperty("Visible", new Variant(false));

Object docs = app.getProperty("document．").toDispatch();

Object doc = Dispatch

.invoke(docs, "Open", Dispatch.Method, new Object[]

{inFile, new Variant(false), new Variant(true)}, new int[1])

.toDispatch();

Dispatch.invoke(doc, "SaveAs", Dispatch.Method, new Object[]

{tpFile, new Variant(8)}, new int[1]);

Variant f = new Variant(false);

Dispatch.call(doc, "Close", f);

flag = true;

} catch (Exception e)

{

e.printStackTrace();

} finally

{

app.invoke("Quit", new Variant[]

{});

}

    2。用apache的poi来抽取word，excel。
    poi是apache的一个项目，不过就算用poi你可能都觉得很烦，不过不要紧，这里提供了更加简单的一个接口给你：
    下载经过封装后的poi包：
http://www.matrix.org.cn/down_view.asp?id=14
    下载之后，放到你的classpath就可以了，下面是如何使用它的一个例子：

import java.io.*;

import org.textmining.text.extraction.WordExtractor;

/**

*

* Title: pdf extraction

*

*

* Description: email:chris@matrix.org.cn

*

*

*

*

* Company: Matrix.org.cn

*

* @author chris

* @version 1.0,who use this example pls remain the declare

public class PdfExtractor

{

public PdfExtractor()

{

}

public static void main(String args[]) throws Exception

{

FileInputStream in = new FileInputStream("c:\\a.doc");

WordExtractor extractor = new WordExtractor();

String str = extractor.extractText(in);

System.out.println("the result length is" + str.length());

System.out.println("the result is" + str);

}

3。pdfbox-用来抽取pdf文件
但是pdfbox对中文支持还不好，先下载pdfbox：

http://www.matrix.org.cn/down_view.asp?id=12
下面是一个如何使用pdfbox抽取pdf文件的例子：

import org.pdfbox.pdmodel.PDdocument．

import org.pdfbox.pdfparser.PDFParser;

import java.io.*;

import org.pdfbox.util.PDFTextStripper;

import java.util.Date;

/**

*

* Title: pdf extraction

*

*

* Description: email:chris@matrix.org.cn

*

*

*

*

* Company: Matrix.org.cn

*

* @author chris

* @version 1.0,who use this example pls remain the declare

public class PdfExtracter

{

public PdfExtracter()

{

}

public String GetTextFromPdf(String filename) throws Exception

{

String temp=null;

PDdocument．nbsppdfdocument．null;

FileInputStream is=new FileInputStream(filename);

PDFParser parser = new PDFParser( is );

parser.parse();

pdfdocument．nbsp= parser.getPDdocument．);

ByteArrayOutputStream out = new ByteArrayOutputStream();

OutputStreamWriter writer = new OutputStreamWriter( out );

PDFTextStripper stripper = new PDFTextStripper();

stripper.writeText(pdfdocument．getdocument．), writer );

writer.close();

byte[] contents = out.toByteArray();

String ts=new String(contents);

System.out.println("the string length is"+contents.length+"\n");

return ts;

}

public static void main(String args[])

{

PdfExtracter pf=new PdfExtracter();

PDdocument．nbsppdfdocument．nbsp= null;

try

{

String ts=pf.GetTextFromPdf("c:\\a.pdf");

System.out.println(ts);

}

catch(Exception e)

{

e.printStackTrace();

}

     4.抽取支持中文的pdf文件－xpdf
   xpdf是一个开源项目，我们可以调用他的本地方法来实现抽取中文pdf文件。
下载xpdf函数包：
http://www.matrix.org.cn/down_view.asp?id=15
同时需要下载支持中文的补丁包：
http://www.matrix.org.cn/down_view.asp?id=16
   按照readme放好中文的patch，就可以开始写调用本地方法的java程序了
下面是一个如何调用的例子：

import java.io.*;

/**

*

* Title: pdf extraction

*

*

* Description: email:chris@matrix.org.cn

*

*

*

*

* Company: Matrix.org.cn

*

* @author chris

* @version 1.0,who use this example pls remain the declare

public class PdfWin

{

public PdfWin()

{

}

public static void main(String args[]) throws Exception

{

String PATH_TO_XPDF = "C:\\Program Files\\xpdf\\pdftotext.exe";

String filename = "c:\\a.pdf";

String[] cmd = new String[]

{PATH_TO_XPDF, "-enc", "UTF-8", "-q", filename, "-"};

Process p = Runtime.getRuntime().exec(cmd);

BufferedInputStream bis = new BufferedInputStream(p.getInputStream());

InputStreamReader reader = new InputStreamReader(bis, "UTF-8");

StringWriter out = new StringWriter();

char[] buf = new char[10000];

int len;

while ((len = reader.read(buf)) >= 0)

{

// out.write(buf, 0, len);

System.out.println("the length is" + len);

}

reader.close();

String ts = new String(buf);

System.out.println("the str is" + ts);

}

posted on 2006-11-27 10:26 MyBox 阅读(199) 评论(0) 编辑收藏

常用链接

留言簿(3)

随笔分类

随笔档案

文章档案

相册

搜索

最新评论

阅读排行榜

评论排行榜


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理