转换HTML内容为PDF格式(2)

Posted on 2006-06-01 07:56 shaofan 阅读(3534) 评论(10) 编辑收藏所属分类: Java

Java 程序

通过使用上述步骤中用过的三个工具的 DOM API ，我接下来会展示一个 JAVA 程序。它在运行时需要提供两个命令行参数，会自动生成相应的 PDF 文档，并且不会产生任何临时文件。

第一个程序新建一个 HTML 文件的 InputStream 对象，然后此对象被传给 JTidy 。

JTidy 有个方法叫 parseDOM() ，可以用来生成输出的 XHTML 文档的 Document 对象。

public static void main(String[] args) {

// 打开文件

if (args.length != 2) {

System.out.println("Usage: Html2Pdf htmlFile styleSheet");

System.exit(1);

}

FileInputStream input = null;

String htmlFileName = args[0];

try {

input = new FileInputStream(htmlFileName);

}

catch (java.io.FileNotFoundException e) {

System.out.println("File not found: " + htmlFileName);

}

Tidy tidy = new Tidy();

Document xmlDoc = tidy.parseDOM(input, null);

JTidy 的 DOM 实现并不支持 XML 命名空间。因此，我们必需修改 Antenna House 的样式表，让它使用默认的命名空间。比如，原来是：

<xsl:template match="html:h2">

<fo:block xsl:use-attribute-sets="h2">

<xsl:call-template name="process-common-attributes-and-children"/>

</fo:block>

</xsl:template>

被修改后是：

<xsl:template match="h2">

<fo:block xsl:use-attribute-sets="h2">

<xsl:call-template name="process-common-attributes-and-children"/>

</fo:block>

</xsl:template>

这个改动必需被应用到 xhtml2f0.xsl 中的所有模板，因为 JTidy 生成的 Document 对象以 <html> 标签作为根，如：

修改后的 xhtml2fo.xsl 包含在这篇文章附带的源代码中。

接着， xml2FO() 方法调用 Xalan ，使样式表应用于 JTidy 生成的 DOM 对象：

Document foDoc = xml2FO(xmlDoc, args[1]);

方法 xml2FO() 首先调用 getTransformer() 来获得一个指定的样式表的 Transformer 对象。然后，代表着转换结果的那个 Document 被返回：

private static Document xml2FO(Document xml, String styleSheet) {

DOMSource xmlDomSource = new DOMSource(xml);

DOMResult domResult = new DOMResult();

Transformer transformer = getTransformer(styleSheet);

if (transformer == null) {

System.out.println("Error creating transformer for " + styleSheet);

System.exit(1);

}

try {

transformer.transform(xmlDomSource, domResult);

}

catch (javax.xml.transform.TransformerException e) {

return null;

}

return (Document) domResult.getNode();

}

接着， main 方法用与 HTML 输入文件相同的前缀来打开一个 FileOutputStream 。然后调用 fo2PDF() 方法所获得的结果被写入 OutputStream ：

String pdfFileName = htmlFileName.substring(0, htmlFileName.indexOf(".")) + ".pdf";

try {

OutputStream pdf = new FileOutputStream(new File(pdfFileName));

pdf.write(fo2PDF(foDoc));

}

catch (java.io.FileNotFoundException e) {

System.out.println("Error creating PDF: " + pdfFileName);

}

catch (java.io.IOException e) {

System.out.println("Error writing PDF: " + pdfFileName);

}

方法 fo2PDF() 会使用在转换中产生的 XSL-FO Document 和一个 ByteArrayOutputStream 来生成一个 FOP driver 对象。通过调用 Driver.run 可以生成 PDF 文件。结果被作为一个 byte array 返回：

private static byte[] fo2PDF(Document foDocument) {

DocumentInputSource fopInputSource = new DocumentInputSource(

foDocument);

try {

ByteArrayOutputStream out = new ByteArrayOutputStream();

Logger log = new ConsoleLogger(ConsoleLogger.LEVEL_WARN);

Driver driver = new Driver(fopInputSource, out);

driver.setLogger(log);

driver.setRenderer(Driver.RENDER_PDF);

driver.run();

return out.toByteArray();

} catch (Exception ex) {

return null;

}

Html2Pdf.java 的源代码可以在这篇文章的附带代码中找到。

使用 DOM API 来完成这整个过程，速度要比使用命令行界面快得多，因为它不需要往磁盘中写入任何中间文件。这种方法可以集成到服务器里，来处理并发的 HTML-PDF 转换请求。

以前我曾以这里展示的这个程序为基础把生成 PDF 的功能集成到一个 WEB 应用。而生成 PDF 的过程是动态的，因此不需要考虑 WEB 页面和相应 PDF 同步的问题，因为生成的 PDF 文件并不是存放在服务器上。

结论

综述，在本文里我描述了怎样利用开源组件来实现 HTML 到 PDF 的转换。虽然这种实现方法在价格和源码方面很有吸引力，但同时也有一定的折衷。一些商业组件可以提供更完整的标准实现。

比如说， FOP 目前的版本是 .91 ，不完全支持 XSL-FO 标准。尽管如此，相对其它的格式而言，对 PDF 提供了更多的支持。

在开始一个文档转换的项目之前，你必需考虑对文档格式的需求，并把它们与已有组件所实现的功能做个对比。这将有助于做出正确的决定。

资源

# 下载本文中的源码 :

http://www.javaworld.com/javaworld/jw-04-2006/html/jw-0410-html.zip

# Adobe's Document Server 产品 :

http://www.adobe.com/products/server/documentserver/main.html

# Antenna House ( 出售商业的格式化程序 ):

http://www.antennahouse.com

# xhtml2fo.xsl 把 XHTML 转化为 XSL-FO 的样式表 :

http://www.antennahouse.com/XSLsample/XSLsample.htm

# Apache FOP formatter 把 XSL-FO 翻译为 PDF:

http://xmlgraphics.apache.org/fop

# FOP 对 XSL-FO 标准的兼容性 :

http://xmlgraphics.apache.org/fop/compliance.html

# JTidy ，把 HTML 转化为 XHTML:

http://sourceforge.net/projects/jtidy/

# Xalan:

http://xalan.apache.org/

# XSL-FO, Dave Pawson (O'Reilly Media, 2002 年 8 月 ; ISBN: 0596003552):

http://www.amazon.com/exec/obidos/ASIN/0596003552/javaworld

# XSLT 2.0 Programmer's Reference, Michael Kay (Wrox, 2004 年 8 月 ; ISBN: 0764569090):

http://www.amazon.com/exec/obidos/ASIN/0764569090/javaworld

# 浏览 JavaWorld 的 Development Tools 部分可以找到更多关于 Java 开发工具的文章 :

http://www.javaworld.com/channel_content/jw-tools-index.shtml

# JavaWorld 的 Java 与 XML 部分的文章索引 :

http://www.javaworld.com/channel_content/jw-xml-index.shtml

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2006-06-02 11:28 by 六世软件

好耶,我试试

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2006-06-02 12:41 by 六世软件

不支持中文

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2006-06-02 17:47 by shaofan

@六世软件
哦，中文我确实没有试......是个问题啊

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2006-06-05 16:44 by 六世软件

如何解决中文问题?

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2006-06-08 06:10 by shaofan

@六世软件
这个不太清楚哦～有空看看

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2007-04-02 09:09 by howesen

中文问题我已经解觉了，就间重写了itext包中的方法，现在已经在我的主页中提供修改过的下载了，具体怎么个改法，请看我的主页中的介绍。
主页：http://down.latea.cn

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2007-04-02 20:54 by shaofan2

@howesen
能否给个链接？

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2007-07-02 11:12 by 振

@howesen
你就发了一个网址,进去也找不到呀,拜托既然想帮大家,就别绕弯子好不

# re: 转换HTML内容为PDF格式(2)[未登录] 回复 更多评论

2007-11-15 11:46 by Gary

请问Document foDoc = xml2FO(xmlDoc, args[1])中,args[1]在文中的参数是什么?

# re: 转换HTML内容为PDF格式(2) 回复 更多评论

2008-07-22 03:30 by Hi

how to keep the blank lines for html

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻知识库 C++博客博问管理
相关文章: 关于三层架构的一些想法 Struts2客户端验证的一个bug 简单对比一下Python/Django和Java/Struts/JSP的请求处理结构关于DOM的另一篇用javascript直接调用java程序自私的NetBeans 建立一个最简单的Webwork应用程序转换HTML内容为PDF格式(2) 转换HTML内容为PDF格式(1) 差点栽在JAVA路径设置的问题上

Shao Fan