如何在java中读取或解析MHTML（.mht）文件

我需要挖掘大多数已知文档文件的内容，例如：

PDF格式
HTML
doc / docx等

对于我计划使用的大多数这些文件格式：

http://tika.apache.org/

但截至目前， Tika不支持MHTML（* .mht）文件..（ http://en.wikipedia.org/wiki/MHTML ）C＃中的例子很少（ http://www.codeproject.com/KB /files/MhtBuilder.aspx ）但我在Java中找不到。

我尝试在7Zip中打开* .mht文件但它失败了……虽然WinZip能够将文件解压缩为图像和文本（CSS，HTML，Script）作为文本和二进制文件…

根据MSDN页面（ http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ）和前面提到的code project页面…… mht文件使用GZip压缩。 …

尝试在java中解压缩导致以下exception：使用java.uti.zip.GZIPInputStream

 java.io.IOException: Not in GZIP format at java.util.zip.GZIPInputStream.readHeader(Unknown Source) at java.util.zip.GZIPInputStream.(Unknown Source) at java.util.zip.GZIPInputStream.(Unknown Source) at GZipTest.main(GZipTest.java:16)

并使用java.util.zip.ZipFile

  java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.(Unknown Source) at java.util.zip.ZipFile.(Unknown Source) at GZipTest.main(GZipTest.java:21)

请建议如何解压缩….

谢谢….

坦率地说，我不期待在不久的将来找到解决方案并且即将放弃，但有些我偶然发现了这个页面：

http://en.wikipedia.org/wiki/MIME#Multipart_messages

http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx

虽然，第一眼看上去并不是很吸引人。但如果仔细观察，你会得到线索。阅读本文后，我启动了我的IE并随机开始将页面保存为*.mht文件。让我一行一行……

但是，让我先解释一下，我的最终目标是分离/提取出html内容并解析它……解决方案本身并不完整，因为它取决于我在保存时选择的character set或encoding 。但即使它会轻微提取单个文件……

我希望这对任何试图解析/解压缩*.mht/MHTML文件的人都有用:)

=======解释======== **取自mht文件**

 From: "Saved by Windows Internet Explorer 7"

它是用于保存文件的软件

 Subject: Google Date: Tue, 13 Jul 2010 21:23:03 +0530 MIME-Version: 1.0

主题，日期和哑剧版……很像邮件格式

  Content-Type: multipart/related; type="text/html";

这是告诉我们它是一个multipart文档的multipart 。多部分文档在一个主体中组合了一个或多个不同的数据集， multipart内容类型字段必须出现在实体的标题中。在这里，我们还可以看到类型为"text/html" 。

 boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0"

最重要的是，这是最重要的部分。这是唯一的分隔符，它分为两个不同的部分（html，图像，css，脚本等）。一旦你掌握了这一点，一切都变得简单……现在，我只需要遍历文档并查找不同的部分并按照Content-Transfer-Encoding （base64，quoted-printable等）保存它们…… 。。。

样品

  ------=_NextPart_000_0007_01CB22D1.93BBD1A0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8


  ** JAVA代码** 
 用于定义常量的接口。 
 public interface IConstants { public String BOUNDARY = "boundary"; public String CHAR_SET = "charset"; public String CONTENT_TYPE = "Content-Type"; public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding"; public String CONTENT_LOCATION = "Content-Location"; public String UTF8_BOM = "=EF=BB=BF"; public String UTF16_BOM1 = "=FF=FE"; public String UTF16_BOM2 = "=FE=FF"; } 
 主解析器类...... 
 /** * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0 * which accompanies this distribution, and is available at * http://www.eclipse.org/legal/epl-v10.html */ package com.test.mht.core; import java.io.BufferedOutputStream; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileOutputStream; import java.io.FileReader; import java.io.OutputStreamWriter; import java.util.regex.Matcher; import java.util.regex.Pattern; import sun.misc.BASE64Decoder; /** * File to parse and decompose *.mts file in its constituting parts. * @author Manish Shukla */ public class MHTParser implements IConstants { private File mhtFile; private File outputFolder; public MHTParser(File mhtFile, File outputFolder) { this.mhtFile = mhtFile; this.outputFolder = outputFolder; } /** * @throws Exception */ public void decompress() throws Exception { BufferedReader reader = null; String type = ""; String encoding = ""; String location = ""; String filename = ""; String charset = "utf-8"; StringBuilder buffer = null; try { reader = new BufferedReader(new FileReader(mhtFile)); final String boundary = getBoundary(reader); if(boundary == null) throw new Exception("Failed to find document 'boundary'... Aborting"); String line = null; int i = 1; while((line = reader.readLine()) != null) { String temp = line.trim(); if(temp.contains(boundary)) { if(buffer != null) { writeBufferContentToFile(buffer,encoding,filename,charset); buffer = null; } buffer = new StringBuilder(); }else if(temp.startsWith(CONTENT_TYPE)) { type = getType(temp); }else if(temp.startsWith(CHAR_SET)) { charset = getCharSet(temp); }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) { encoding = getEncoding(temp); }else if(temp.startsWith(CONTENT_LOCATION)) { location = temp.substring(temp.indexOf(":")+1).trim(); i++; filename = getFileName(location,type); }else { if(buffer != null) { buffer.append(line + "\n"); } } } }finally { if(null != reader) reader.close(); } } private String getCharSet(String temp) { String t = temp.split("=")[1].trim(); return t.substring(1, t.length()-1); } /** * Save the file as per character set and encoding */ private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) throws Exception { if(!outputFolder.exists()) outputFolder.mkdirs(); byte[] content = null; boolean text = true; if(encoding.equalsIgnoreCase("base64")){ content = getBase64EncodedString(buffer); text = false; }else if(encoding.equalsIgnoreCase("quoted-printable")) { content = getQuotedPrintableString(buffer); } else content = buffer.toString().getBytes(); if(!text) { BufferedOutputStream bos = null; try { bos = new BufferedOutputStream(new FileOutputStream(filename)); bos.write(content); bos.flush(); }finally { bos.close(); } }else { BufferedWriter bw = null; try { bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset)); bw.write(new String(content)); bw.flush(); }finally { bw.close(); } } } /** * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF' * @see http://en.wikipedia.org/wiki/Byte_order_mark */ private byte[] getQuotedPrintableString(StringBuilder buffer) { //Set uniqueHex = new HashSet(); //final Pattern p = Pattern.compile("(=\\p{XDigit}{2})*"); String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", ""); //Matcher m = p.matcher(temp); //while(m.find()) { // uniqueHex.add(m.group()); //} //System.out.println(uniqueHex); //for (String hex : uniqueHex) { //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1))); //} return temp.getBytes(); } /*private String getASCIIValue(String hex) { return ""+(char)Integer.parseInt(hex, 16); }*/ /** * Although system dependent..it works well */ private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception { return new BASE64Decoder().decodeBuffer(buffer.toString()); } /** * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL. * Otherwise it returns 'unknown.' */ private String getFileName(String location, String type) { final Pattern p = Pattern.compile("(\\w|_|-)+\\.\\w+"); String ext = ""; String name = ""; if(type.toLowerCase().endsWith("jpeg")) ext = "jpg"; else ext = type.split("/")[1]; if(location.endsWith("/")) { name = "main"; }else { name = location.substring(location.lastIndexOf("/") + 1); Matcher m = p.matcher(name); String fname = ""; while(m.find()) { fname = m.group(); } if(fname.trim().length() == 0) name = "unknown"; else return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length())); } return getUniqueName(name,ext); } /** * Returns a qualified unique output file path for the parsed path. * In case the file already exist it appends a numarical value a continues */ private String getUniqueName(String name,String ext) { int i = 1; File file = new File(outputFolder,name + "." + ext); if(file.exists()) { while(true) { file = new File(outputFolder, name + i + "." + ext); if(!file.exists()) return file.getAbsolutePath(); i++; } } return file.getAbsolutePath(); } private String getType(String line) { return splitUsingColonSpace(line); } private String getEncoding(String line){ return splitUsingColonSpace(line); } private String splitUsingColonSpace(String line) { return line.split(":\\s*")[1].replaceAll(";", ""); } /** * Gives you the boundary string */ private String getBoundary(BufferedReader reader) throws Exception { String line = null; while((line = reader.readLine()) != null) { line = line.trim(); if(line.startsWith(BOUNDARY)) { return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\"")); } } return null; } } 
 问候，



		      	 您不必自己动手。 
 依赖 
  org.apache.james apache-mime4j 0.7.2  
 滚动你的文件 
 public static void main(String[] args) { MessageTree.main(new String[]{"YOU MHT FILE PATH"}); } 
  MessageTree会 
 /** * Displays a parsed Message in a window. The window will be divided into * two panels. The left panel displays the Message tree. Clicking on a * node in the tree shows information on that node in the right panel. * * Some of this code have been copied from the Java tutorial's JTree section. */ 
 然后你可以调查一下。 
  😉 



		      	 您可以尝试http://www.chilkatsoft.com/mht-features.asp ，它可以打包/解压缩，您可以像普通文件一样处理它。 下载链接是： http ： //www.chilkatsoft.com/java.asp 



		      	 我用http://jtidy.sourceforge.net来解析/读取/索引mht文件（但是作为普通文件，而不是压缩文件） 



		      	 迟到了，但是@wener的答案正在扩大到其他任何绊脚石的人的答案。 
  Apache Mime4J库似乎拥有最易于访问的EML或MHTML处理解决方案，比滚动自己更容易！ 
 我的原型’ parseMhtToFile ‘函数在下面从Cognos活动报告’mht’文件中删除了html文件和其他工件，但可以根据其他目的进行调整。 
 这是用Groovy编写的，需要Apache Mime4J’核心’和’dom’jar子 （目前为0.7.2）。 
 import org.apache.james.mime4j.dom.Message import org.apache.james.mime4j.dom.Multipart import org.apache.james.mime4j.dom.field.ContentTypeField import org.apache.james.mime4j.message.DefaultMessageBuilder import org.apache.james.mime4j.stream.MimeConfig /** * Use Mime4J MessageBuilder to parse an mhtml file (assumes multipart) into * separate html files. * Files will be written to outDir (or parent) as baseName + partIdx + ext. */ void parseMhtToFile(File mhtFile, File outDir = null) { if (!outDir) {outDir = mhtFile.parentFile } // File baseName will be used in generating new filenames def mhtBaseName = mhtFile.name.replaceFirst(~/\.[^\.]+$/, '') // -- Set up Mime parser, using Default Message Builder MimeConfig parserConfig = new MimeConfig(); parserConfig.setMaxHeaderLen(-1); // The default is a mere 10k parserConfig.setMaxLineLen(-1); // The default is only 1000 characters. parserConfig.setMaxHeaderCount(-1); // Disable the check for header count. DefaultMessageBuilder builder = new DefaultMessageBuilder(); builder.setMimeEntityConfig(parserConfig); // -- Parse the MHT stream data into a Message object println "Parsing ${mhtFile}..."; InputStream mhtStream = mhtFile.newInputStream() Message message = builder.parseMessage(mhtStream); // -- Process the resulting body parts, writing to file assert message.getBody() instanceof Multipart Multipart multipart = (Multipart) message.getBody(); def parts = multipart.getBodyParts(); parts.eachWithIndex { p, i -> ContentTypeField cType = p.header.getField('content-type') println "${p.class.simpleName}\t${i}\t${cType.mimeType}" // Assume mime sub-type is a "good enough" file-name extension // eg text/html = html, image/png = png, application/json = json String partFileName = "${mhtBaseName}_${i}.${cType.subType}" File partFile = new File(outDir, partFileName) // Write part body stream to file println "Writing ${partFile}..."; if (partFile.exists()) partFile.delete(); InputStream partStream = p.body.inputStream; partFile.append(partStream); } } 
 用法很简单： 
 File mhtFile = new File('', 'Report-en-au.mht') parseMhtToFile(mhtFile) println 'Done.' 
 输出是： 
 Parsing \Report-en-au.mht... BodyPart 0 text/html Writing \Report-en-au_0.html... BodyPart 1 image/png Writing \Report-en-au_1.png... Done. 
 关于其他改进的想法： 


 对于’text’mime部分 ，您可以访问Reader而不是Stream ，这可能更适合OP请求的文本挖掘。 


 对于生成的文件扩展名，我会使用另一个库来查找适当的扩展名，而不是假设mime子类型是足够的。 


 处理单体（非Multipart）和递归Multipart mhtml文件和其他复杂性。 这些可能需要具有自定义内容处理程序实现的MimeStreamParser 。



  JavaFx tableview排序非常慢，如何提高排序速度，就像在java swing中一样
  如何限制受保护的方法只能访问子类
	java中使用generics的工厂方法模式，如何？
在Java中检索给定URL的最终位置
创建大小为n的布尔数组的所有可能方式？
使用PdfBox，如何将PDDocument的内容检索为字节数组？
使用IAM身份validation和Spring JDBC（DataSource和JdbcTemplace）访问AWS RDS
特定于请求参数的JavafilterURL模式
Java String to Date对象格式为“yyyy-mm-dd HH：mm：ss”
如何通过按键盘上的DELETE删除JTable中的一行
使用Spring的Quartz作业和调度任务之间的区别？

如何在java中读取或解析MHTML（.mht）文件

JFreeChart – 自定义RingChart

Java：如何将RGB颜色转换为CIE Lab

Java相当于C＃动态类类型？

在libgdx中，如何在矩形顶部附加圆形？

如何在jaxb中解组并享受模式validation而不使用显式模式文件

Java：在oracle数据库中调用存储过程

APPARENT DEADLOCK为未分配的待处理任务创建紧急线程

模拟java.time.format.DateTimeFormatter类

通过正则表达式替换StringBuilder中的文本

java中的instanceof运算符用于比较不同的类