如何用scala方式解析基于行的文本文件(.mht)?

我想使用scala来解析.mht文件,但我发现我的代码与Java完全一样。

以下是mht文件示例:

 From:  Subject: Tencent IM Message MIME-Version: 1.0 Content-Type:multipart/related; charset="utf-8" type="text/html"; boundary="----=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19" ------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19 Content-Type: text/html Content-Transfer-Encoding:7bit ... ------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19 Content-Type:image/jpeg Content-Transfer-Encoding:base64 Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat /9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi ------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19 Content-Type:image/jpeg Content-Transfer-Encoding:base64 Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat /9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi ------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19 Content-Type:image/jpeg Content-Transfer-Encoding:base64 Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat /9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi ------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19 

有一个叫做boundary的特殊线,它是一个分隔线:

 ------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19 

第一部分是关于此文件的一些信息,可以忽略。 以下是4个块,第一个是html文件,其他是带有base64编码文本的jpg图像。

如果我使用Java,代码如下:

 BufferedReader reader = new BufferedReader(new FileInputStream(new File("test.mht"))) String line = null; String boundary = null; // for a block String contentType = null; String encoding = null; String location = null; List data = null; while((line=reader.readLine())!=null) { // first, get the boundary if(boundary==null) { if(line.trim().startsWith("boundary=\"") { boundary = substringBetween(line, "\"", "\""); } continue; } if(line.equals("--"+boundary) { // new block if(contentType!=null) { // save data to a file } encoding=null; contentType=null; location = null; data = new ArrayList(); } else { if(id==null || contentType==null || location ==null) { if(line.trim().startsWith("Content-Type:") { /* get content type */ } // else check encoding // else check location } else { data.add(line); } } } 

我尝试使用scala重写代码,但我发现我的代码结构几乎相同,除了我使用scala语法而不是Java。

是否有scala方式来做同样的工作?

PS:我不想将整个文件加载到内存中,因为文件很大。 相反,我想逐行阅读和解析它。

谢谢你的帮助!

这可能是一个非常简单的状态机用例。

 import collection.mutable.ListBuffer case class Part(contentType:Option[String], encoding:Option[String], location:Option[String], data:ListBuffer[String]) var boundary: String = null val Boundary = """.*boundary="(.*)"""".r var state = 0 val IN_PART = 1 val IN_DATA = 2 var _contentType:Option[String] = None var _encoding:Option[String] = None var _location:Option[String] = None var _data = new ListBuffer[String]() Source.fromFile("test.mht").getLines.foreach{ case Boundary(b) => boundary = b case `boundary` => _contentType = None _encoding = None _location = None _data = new ListBuffer[String]() state = IN_PART case "" => state match { case IN_PART => state = IN_DATA case IN_DATA => var currentPart = Part(_contentType, _encoding, _location, _data) /* deal with current Part as allData.last */ case _ => } case line => state match { case IN_DATA => _data.append(line) case IN_PART => line.split(":") match { case Array("Content-Type", t) => _contentType = Some(t) case Array("Content-Transfer-Encoding", e) => _encoding = Some(e) case Array("Content-Location", l) => _location = Some(l) case _ => } } } 

我将解释如何使用解析器组合器以标准方式构建通用解决方案。 提供的其他解决方案要快得多,但是,一旦您了解了如何执行此操作,您就可以轻松地将其应用于其他任务。

首先,您显示的是电子邮件。 这些消息的格式在一堆RFC中定义。 RFC-822定义了标题和正文的基础知识,虽然它相当详细地介绍了标题,但没有说明正文。 RFC-1521和1522讨论了MIME,它们本身就是RFC 1341和1342的修订版。还有许多关于这个主题的RFC。

有趣的是,他们为这些东西提供语法,因此您可以编写解析器来正确分解它。 让我们从RFC822的简化版本开始,几乎忽略所有已知字段及其格式,并简单地将所有内容放在地图中。 我这样做是因为语法相当长,我在这里的几行已经可以与RFC中的那些相比较。

在Scala Parser组合器上,每个规则都用~分隔(在RFC中,只是空格分隔它们),我有时会使用<~~>来丢弃它中不感兴趣的部分。 此外,我使用^^将解析的内容转换为要使用的数据结构。

 import scala.util.parsing.combinator._ /** Object companion to RFC822, containing the Message class, * and extending the trait so that it can be used as a parser */ object RFC822 extends RFC822 { case class Message(header: Map[String, String], text: String) } /** * Parsers `message` according to RFC-822 (http://www.w3.org/Protocols/rfc822/), * but without breaking up the contents for each field, * nor identifying particular fields. * * Also, introduces "header" to convert all fields into a map. */ class RFC822 extends RegexParsers { import RFC822.Message override def skipWhitespace = false def message = (header <~ CRLF) ~ text ^^ { case hd ~ txt => Message(hd, txt) } // this isn't part of the RFC, but we use it to generate a map def header = field.* ^^ { _.toMap } def field = (fieldName <~ ":") ~ fieldBody <~ CRLF ^^ { case name ~ body => name -> body } def fieldName = """[^:\P{Graph}]+""".r // Recursive definition needs a type // Also, I use .+ on LWSPChar because it's specified for the lexer, // which we are not using def fieldBody: Parser[String] = fieldBodyContents ~ (CRLF ~> LWSPChar.+ ~> fieldBody).? ^^ { case a ~ Some(b) => a + " " + b // reintroduces a single LWSPChar case a ~ None => a } def fieldBodyContents = ".*".r def CRLF = """\n""".r // this needs to be the regex \n pattern def LWSPChar = " " | "\t" // these do not need to be regex def text = "(?s).*".r // (?s) makes . match newlines } 

现在让我们来处理内容类型。 RFC-1521的规范是在下面实现的。 我在反引号之间有单词type ,因为它是Scala中的保留字。 另外,我正在制作一个分号可选,因为你给出的样本在定义char-set后缺少一个。

 object ContentType extends ContentType { case class Content(`type`: String, subtype: String, parameter: Map[String, String]) } class ContentType extends RegexParsers { import ContentType.Content // case-insensitive matching of type and subtype def content = ("Content-Type" ~> ":" ~> `type` <~ "/") ~ subtype ~ parameters ^^ { case t ~ s ~ p => Content(t, s, p) } // use this to generate a map // *** SEMI-COLON IS NOT OPTIONAL *** // I'm making it optional because the example is missing one def parameters = (";".? ~> parameter).* ^^ (_.toMap) // All values case-insensitive def `type` = ( "(?i)application".r | "(?i)audio".r | "(?i)image".r | "(?i)message".r | "(?i)multipart".r | "(?i)text".r | "(?i)video".r | extensionToken ) def extensionToken = xToken | ianaToken def ianaToken = failure("IANA token not implemented") def xToken = """(?i)x-(?!\s)""".r ~ token ^^ { case a ~ b => a + b } def subtype = token def parameter = (attribute <~ "=") ~ value ^^ { case a ~ b => a -> b } def attribute = token // case-insensitive def value = token | quotedString def token: Parser[String] = not(tspecials) ~> """\p{Graph}""".r ~ token.? ^^ { case a ~ Some(b) => a + b case a ~ None => a } // Must be in quoted-string, // to use within parameter values def tspecials = ( "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\\" | "\"" | "/" | "[" | "]" | "?" | "=" ) // These are part of RFC822 def qtext = """[^\\"\n]""".r def quotedPair = """\\.""".r def quotedString = "\"" ~> (qtext|quotedPair).* <~ "\"" ^^ { _.mkString } } 

我们现在可以使用它来解析文本。

 object Parser { def apply(email: String): Option[(Map[String, String], List[String])] = { import RFC822._ parseAll (message, email) match { case Success(result, _) => if (result.header get "Content-Type" nonEmpty) Some(getParts(result)) else Some(result.header -> List(result.text)) case _ => None } } def getParts(message: RFC822.Message): (Map[String, String], List[String]) = { import ContentType._ parseAll (content, "Content-Type: " + message.header("Content-Type")) match { case Success(Content("multipart", _, parameters), _) => // The ^.* part eats starting characters; it doesn't seem to be // as spec'ed, but the sample has two extra dashes at the start // of the line val parts = message.text split ("^.*?\\Q" + parameters("boundary") + "\\E") val bodies = flatMap this.apply flatMap (_._2) message.header -> bodies.toList case _ => message.header -> List(message.text) } } } 

然后,您可以像Parser(email)一样使用它。

同样,我不建议你使用这个解决方案来解决当前的问题! 但是学习这一点可能会对你有所帮助。