如何在java中读取或解析MHTML(.mht)文件

我需要挖掘大多数已知文档文件的内容 ,例如:

  1. PDF格式
  2. HTML
  3. doc / docx等

对于我计划使用的大多数这些文件格式:

http://tika.apache.org/

但截至目前, Tika不支持MHTML(* .mht)文件..( http://en.wikipedia.org/wiki/MHTML )C#中的例子很少( http://www.codeproject.com/KB /files/MhtBuilder.aspx )但我在Java中找不到。

我尝试在7Zip中打开* .mht文件但它失败了……虽然WinZip能够将文件解压缩为图像和文本(CSS,HTML,Script)作为文本和二进制文件…

根据MSDN页面( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content )和前面提到的code project页面…… mht文件使用GZip压缩。 …

尝试在java中解压缩导致以下exception:使用java.uti.zip.GZIPInputStream

 java.io.IOException: Not in GZIP format at java.util.zip.GZIPInputStream.readHeader(Unknown Source) at java.util.zip.GZIPInputStream.(Unknown Source) at java.util.zip.GZIPInputStream.(Unknown Source) at GZipTest.main(GZipTest.java:16) 

并使用java.util.zip.ZipFile

  java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.(Unknown Source) at java.util.zip.ZipFile.(Unknown Source) at GZipTest.main(GZipTest.java:21) 

请建议如何解压缩….

谢谢….

坦率地说,我不期待在不久的将来找到解决方案并且即将放弃,但有些我偶然发现了这个页面:

http://en.wikipedia.org/wiki/MIME#Multipart_messages

http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx

虽然,第一眼看上去并不是很吸引人。 但如果仔细观察,你会得到线索。 阅读本文后,我启动了我的IE并随机开始将页面保存为*.mht文件。 让我一行一行……

但是,让我先解释一下,我的最终目标是分离/提取出html内容并解析它……解决方案本身并不完整,因为它取决于我在保存时选择的character setencoding 。 但即使它会轻微提取单个文件……

我希望这对任何试图解析/解压缩*.mht/MHTML文件的人都有用:)

=======解释======== **取自mht文件**

 From: "Saved by Windows Internet Explorer 7" 

它是用于保存文件的软件

 Subject: Google Date: Tue, 13 Jul 2010 21:23:03 +0530 MIME-Version: 1.0 

主题,日期和哑剧版……很像邮件格式

  Content-Type: multipart/related; type="text/html"; 

这是告诉我们它是一个multipart文档的multipart 。 多部分文档在一个主体中组合了一个或多个不同的数据集, multipart内容类型字段必须出现在实体的标题中。 在这里,我们还可以看到类型为"text/html"

 boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0" 

最重要的是,这是最重要的部分。 这是唯一的分隔符,它分为两个不同的部分(html,图像,css,脚本等)。 一旦你掌握了这一点,一切都变得简单……现在,我只需要遍历文档并查找不同的部分并按照Content-Transfer-Encoding (base64,quoted-printable等)保存它们…… 。 。 。

样品

  ------=_NextPart_000_0007_01CB22D1.93BBD1A0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8  

** JAVA代码**

用于定义常量的接口。

 public interface IConstants { public String BOUNDARY = "boundary"; public String CHAR_SET = "charset"; public String CONTENT_TYPE = "Content-Type"; public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding"; public String CONTENT_LOCATION = "Content-Location"; public String UTF8_BOM = "=EF=BB=BF"; public String UTF16_BOM1 = "=FF=FE"; public String UTF16_BOM2 = "=FE=FF"; } 

主解析器类......

 /** * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0 * which accompanies this distribution, and is available at * http://www.eclipse.org/legal/epl-v10.html */ package com.test.mht.core; import java.io.BufferedOutputStream; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileOutputStream; import java.io.FileReader; import java.io.OutputStreamWriter; import java.util.regex.Matcher; import java.util.regex.Pattern; import sun.misc.BASE64Decoder; /** * File to parse and decompose *.mts file in its constituting parts. * @author Manish Shukla */ public class MHTParser implements IConstants { private File mhtFile; private File outputFolder; public MHTParser(File mhtFile, File outputFolder) { this.mhtFile = mhtFile; this.outputFolder = outputFolder; } /** * @throws Exception */ public void decompress() throws Exception { BufferedReader reader = null; String type = ""; String encoding = ""; String location = ""; String filename = ""; String charset = "utf-8"; StringBuilder buffer = null; try { reader = new BufferedReader(new FileReader(mhtFile)); final String boundary = getBoundary(reader); if(boundary == null) throw new Exception("Failed to find document 'boundary'... Aborting"); String line = null; int i = 1; while((line = reader.readLine()) != null) { String temp = line.trim(); if(temp.contains(boundary)) { if(buffer != null) { writeBufferContentToFile(buffer,encoding,filename,charset); buffer = null; } buffer = new StringBuilder(); }else if(temp.startsWith(CONTENT_TYPE)) { type = getType(temp); }else if(temp.startsWith(CHAR_SET)) { charset = getCharSet(temp); }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) { encoding = getEncoding(temp); }else if(temp.startsWith(CONTENT_LOCATION)) { location = temp.substring(temp.indexOf(":")+1).trim(); i++; filename = getFileName(location,type); }else { if(buffer != null) { buffer.append(line + "\n"); } } } }finally { if(null != reader) reader.close(); } } private String getCharSet(String temp) { String t = temp.split("=")[1].trim(); return t.substring(1, t.length()-1); } /** * Save the file as per character set and encoding */ private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) throws Exception { if(!outputFolder.exists()) outputFolder.mkdirs(); byte[] content = null; boolean text = true; if(encoding.equalsIgnoreCase("base64")){ content = getBase64EncodedString(buffer); text = false; }else if(encoding.equalsIgnoreCase("quoted-printable")) { content = getQuotedPrintableString(buffer); } else content = buffer.toString().getBytes(); if(!text) { BufferedOutputStream bos = null; try { bos = new BufferedOutputStream(new FileOutputStream(filename)); bos.write(content); bos.flush(); }finally { bos.close(); } }else { BufferedWriter bw = null; try { bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset)); bw.write(new String(content)); bw.flush(); }finally { bw.close(); } } } /** * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF' * @see http://en.wikipedia.org/wiki/Byte_order_mark */ private byte[] getQuotedPrintableString(StringBuilder buffer) { //Set uniqueHex = new HashSet(); //final Pattern p = Pattern.compile("(=\\p{XDigit}{2})*"); String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", ""); //Matcher m = p.matcher(temp); //while(m.find()) { // uniqueHex.add(m.group()); //} //System.out.println(uniqueHex); //for (String hex : uniqueHex) { //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1))); //} return temp.getBytes(); } /*private String getASCIIValue(String hex) { return ""+(char)Integer.parseInt(hex, 16); }*/ /** * Although system dependent..it works well */ private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception { return new BASE64Decoder().decodeBuffer(buffer.toString()); } /** * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL. * Otherwise it returns 'unknown.' */ private String getFileName(String location, String type) { final Pattern p = Pattern.compile("(\\w|_|-)+\\.\\w+"); String ext = ""; String name = ""; if(type.toLowerCase().endsWith("jpeg")) ext = "jpg"; else ext = type.split("/")[1]; if(location.endsWith("/")) { name = "main"; }else { name = location.substring(location.lastIndexOf("/") + 1); Matcher m = p.matcher(name); String fname = ""; while(m.find()) { fname = m.group(); } if(fname.trim().length() == 0) name = "unknown"; else return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length())); } return getUniqueName(name,ext); } /** * Returns a qualified unique output file path for the parsed path. * In case the file already exist it appends a numarical value a continues */ private String getUniqueName(String name,String ext) { int i = 1; File file = new File(outputFolder,name + "." + ext); if(file.exists()) { while(true) { file = new File(outputFolder, name + i + "." + ext); if(!file.exists()) return file.getAbsolutePath(); i++; } } return file.getAbsolutePath(); } private String getType(String line) { return splitUsingColonSpace(line); } private String getEncoding(String line){ return splitUsingColonSpace(line); } private String splitUsingColonSpace(String line) { return line.split(":\\s*")[1].replaceAll(";", ""); } /** * Gives you the boundary string */ private String getBoundary(BufferedReader reader) throws Exception { String line = null; while((line = reader.readLine()) != null) { line = line.trim(); if(line.startsWith(BOUNDARY)) { return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\"")); } } return null; } } 

问候,

您不必自己动手。

依赖

  org.apache.james apache-mime4j 0.7.2  

滚动你的文件

 public static void main(String[] args) { MessageTree.main(new String[]{"YOU MHT FILE PATH"}); } 

MessageTree

 /** * Displays a parsed Message in a window. The window will be divided into * two panels. The left panel displays the Message tree. Clicking on a * node in the tree shows information on that node in the right panel. * * Some of this code have been copied from the Java tutorial's JTree section. */ 

然后你可以调查一下。

😉

您可以尝试http://www.chilkatsoft.com/mht-features.asp ,它可以打包/解压缩,您可以像普通文件一样处理它。 下载链接是: http : //www.chilkatsoft.com/java.asp

我用http://jtidy.sourceforge.net来解析/读取/索引mht文件(但是作为普通文件,而不是压缩文件)

迟到了,但是@wener的答案正在扩大到其他任何绊脚石的人的答案。

Apache Mime4J库似乎拥有最易于访问的EML或MHTML处理解决方案,比滚动自己更容易!

我的原型’ parseMhtToFile ‘函数在下面从Cognos活动报告’mht’文件中删除了html文件和其他工件,但可以根据其他目的进行调整。

这是用Groovy编写的,需要Apache Mime4J’核心’和’dom’jar子 (目前为0.7.2)。

 import org.apache.james.mime4j.dom.Message import org.apache.james.mime4j.dom.Multipart import org.apache.james.mime4j.dom.field.ContentTypeField import org.apache.james.mime4j.message.DefaultMessageBuilder import org.apache.james.mime4j.stream.MimeConfig /** * Use Mime4J MessageBuilder to parse an mhtml file (assumes multipart) into * separate html files. * Files will be written to outDir (or parent) as baseName + partIdx + ext. */ void parseMhtToFile(File mhtFile, File outDir = null) { if (!outDir) {outDir = mhtFile.parentFile } // File baseName will be used in generating new filenames def mhtBaseName = mhtFile.name.replaceFirst(~/\.[^\.]+$/, '') // -- Set up Mime parser, using Default Message Builder MimeConfig parserConfig = new MimeConfig(); parserConfig.setMaxHeaderLen(-1); // The default is a mere 10k parserConfig.setMaxLineLen(-1); // The default is only 1000 characters. parserConfig.setMaxHeaderCount(-1); // Disable the check for header count. DefaultMessageBuilder builder = new DefaultMessageBuilder(); builder.setMimeEntityConfig(parserConfig); // -- Parse the MHT stream data into a Message object println "Parsing ${mhtFile}..."; InputStream mhtStream = mhtFile.newInputStream() Message message = builder.parseMessage(mhtStream); // -- Process the resulting body parts, writing to file assert message.getBody() instanceof Multipart Multipart multipart = (Multipart) message.getBody(); def parts = multipart.getBodyParts(); parts.eachWithIndex { p, i -> ContentTypeField cType = p.header.getField('content-type') println "${p.class.simpleName}\t${i}\t${cType.mimeType}" // Assume mime sub-type is a "good enough" file-name extension // eg text/html = html, image/png = png, application/json = json String partFileName = "${mhtBaseName}_${i}.${cType.subType}" File partFile = new File(outDir, partFileName) // Write part body stream to file println "Writing ${partFile}..."; if (partFile.exists()) partFile.delete(); InputStream partStream = p.body.inputStream; partFile.append(partStream); } } 

用法很简单:

 File mhtFile = new File('', 'Report-en-au.mht') parseMhtToFile(mhtFile) println 'Done.' 

输出是:

 Parsing \Report-en-au.mht... BodyPart 0 text/html Writing \Report-en-au_0.html... BodyPart 1 image/png Writing \Report-en-au_1.png... Done. 

关于其他改进的想法:

  • 对于’text’mime部分 ,您可以访问Reader而不是Stream ,这可能更适合OP请求的文本挖掘。

  • 对于生成的文件扩展名,我会使用另一个库来查找适当的扩展名,而不是假设mime子类型是足够的。

  • 处理单体(非Multipart)和递归Multipart mhtml文件和其他复杂性。 这些可能需要具有自定义内容处理程序实现的MimeStreamParser 。