更好的解析xml的方法

多年来我一直在解析这样的XML,我不得不承认,当不同元素的数量变得越来越大时,我发现它有点无聊和疲惫,这就是我的意思,样本虚拟XML:

  2003/07/04 123 Acme Alpha   987 Coupler 5   654 Connector 3   579 Clasp 1   

这是相关部分(使用sax):

 public class SaxParser extends DefaultHandler { boolean isItem = false; boolean isOrder = false; boolean isDate = false; boolean isCustomerId = false; private Order order; private Item item; @Override public void startElement(String namespaceURI, String localName, String qName, Attributes atts) { if (localName.equalsIgnoreCase("ORDER")) { order = new Order(); } if (localName.equalsIgnoreCase("DATE")) { isDate = true; } if (localName.equalsIgnoreCase("CUSTOMERID")) { isCustomerId = true; } if (localName.equalsIgnoreCase("ITEM")) { isItem = true; } } public void characters(char ch[], int start, int length) throws SAXException { if (isDate){ SimpleDateFormat formatter = new SimpleDateFormat("yyyy/MM/dd"); String value = new String(ch, start, length); try { order.setDate(formatter.parse(value)); } catch (ParseException e) { e.printStackTrace(); } } if(isCustomerId){ order.setCustomerId(Integer.valueOf(new String(ch, start, length))); } if (isItem) { item = new Item(); isItem = false; } } } 

我想知道有没有办法摆脱这些随着元素数量不断增长的丑陋布尔。 必须有一种更好的方法来解析这个相对简单的xml。 只是通过查看执行此任务所需的代码行看起来很丑陋。

目前我正在使用SAX解析器,但我对任何其他建议持开放态度(除了DOM,我在内存解析器中无法承受,我有大量的XML文件)。

这是使用JAXB和StAX的示例。

输入文件:

    Name 1 
Somestreet 00001 Finland
Name 2
Someotherstreet 43400 Sweden

Person.java:

 @XmlRootElement(name = "Person", namespace = "http://example.org") public class Person { @XmlElement(name = "Name", namespace = "http://example.org") private String name; @XmlElement(name = "Address", namespace = "http://example.org") private Address address; public String getName() { return name; } public Address getAddress() { return address; } } 

Address.java:

 public class Address { @XmlElement(name = "StreetAddress", namespace = "http://example.org") private String streetAddress; @XmlElement(name = "PostalCode", namespace = "http://example.org") private String postalCode; @XmlElement(name = "CountryName", namespace = "http://example.org") private String countryName; public String getStreetAddress() { return streetAddress; } public String getPostalCode() { return postalCode; } public String getCountryName() { return countryName; } } 

PersonlistProcessor.java:

 public class PersonlistProcessor { public static void main(String[] args) throws Exception { new PersonlistProcessor().processPersonlist(PersonlistProcessor.class .getResourceAsStream("personlist.xml")); } // TODO: Instead of throws Exception, all exceptions should be wrapped // inside runtime exception public void processPersonlist(InputStream inputStream) throws Exception { JAXBContext jaxbContext = JAXBContext.newInstance(Person.class); XMLStreamReader xss = XMLInputFactory.newFactory().createXMLStreamReader(inputStream); // Create unmarshaller Unmarshaller unmarshaller = jaxbContext.createUnmarshaller(); // Go to next tag xss.nextTag(); // Require Personlist xss.require(XMLStreamReader.START_ELEMENT, "http://example.org", "Personlist"); // Go to next tag while (xss.nextTag() == XMLStreamReader.START_ELEMENT) { // Require Person xss.require(XMLStreamReader.START_ELEMENT, "http://example.org", "Person"); // Unmarshall person Person person = (Person)unmarshaller.unmarshal(xss); // Process person processPerson(person); } // Require Personlist xss.require(XMLStreamReader.END_ELEMENT, "http://example.org", "Personlist"); } private void processPerson(Person person) { System.out.println(person.getName()); System.out.println(person.getAddress().getCountryName()); } } 

如果您控制XML的定义,则可以使用XML绑定工具,例如JAXB (用于XML绑定的Java体系结构)。在JAXB中,您可以为XML结构定义模式(支持XSD和其他模式)或注释Java类以便定义序列化规则。 一旦在XML和Java之间有清晰的声明性映射,就可以轻松地对XML进行编组和解组。

使用JAXB确实需要比SAX处理程序更多的内存,但是存在按部分处理 XML文档的方法: 处理大型文档 。

Oracle的JAXB页面

我一直在使用xsteam将我自己的对象序列化为xml,然后将它们作为Java对象加载回来。 如果您可以将每个标记表示为POJO并且您正确地注释POJO以匹配xml文件中的类型,您可能会发现它更容易使用。

当String表示XML中的对象时,您只需编写:

Order theOrder = (Order)xstream.fromXML(xmlString);

我总是使用它将对象加载到内存中,但是如果你需要流式传输并处理,你应该可以使用HierarchicalStreamReader迭代文档。 这可能与@Dave建议的Simple非常相似。

在SAX中,解析器在您的处理程序中“推送”事件,因此您必须按照习惯在此处执行所有内务处理。 另一种方法是StAX( javax.xml.stream包),它仍然是流式传输,但是你的代码负责从解析器中“拉”出事件。 这样,在程序的控制流程中编码的元素的逻辑顺序是什么,而不是必须在布尔值中明确表示。

根据XML的精确结构,可能存在使用像XOM这样的工具包的“中间方式”,它具有一种操作模式,您可以将文档的子树解析为类似DOM的对象模型,处理该树枝,然后抛出它离开并解析下一个。 这对于具有许多类似元素的重复文档很有用,每个元素都可以单独处理 – 您可以轻松地在每个树枝内编程到基于树的API,但仍然具有流式行为,可以让您有效地解析大型文档。

 public class ItemProcessor extends NodeFactory { private Nodes emptyNodes = new Nodes(); public Nodes finishMakingElement(Element elt) { if("Item".equals(elt.getLocalName())) { // process the Item element here System.out.println(elt.getFirstChildElement("ItemId").getValue() + ": " + elt.getFirstChildElement("ItemName").getValue()); // then throw it away return emptyNodes; } else { return super.finishMakingElement(elt); } } } 

您可以使用StAX和JAXB的组合来实现类似的function – 定义表示重复元素的JAXB注释类(本示例中为Item),然后创建StAX解析器,导航到第一个Item开始标记,然后您可以解组一个从XMLStreamReader一次完成Item

正如其他人所建议的那样,Stax模型是一种更好的方法来最小化内存占用,因为它是基于推送的模型。 我个人使用了Axio(在Apache Axis中使用)并使用XPath表达式解析元素,这比通过节点元素更简洁,就像在提供的代码片段中所做的那样。

我一直在使用这个库。 它位于标准Java库的顶部,使我更容易。 特别是,您可以按名称请求特定元素或属性,而不是使用您所描述的大“if”语句。

http://marketmovers.blogspot.com/2014/02/the-easy-way-to-read-xml-in-java.html

还有另一个库支持更紧凑的XML解析RTXML。 该图书馆及其文档在rasmustorkel.com上 。 我在原始问题中实现了文件的解析,我在这里包含了完整的程序:

 package for_so; import java.io.File; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import rasmus_torkel.xml_basic.read.TagNode; import rasmus_torkel.xml_basic.read.XmlReadOptions; import rasmus_torkel.xml_basic.read.impl.XmlReader; public class Q15626686_ReadOrder { public static class Order { public final Date _date; public final int _customerId; public final String _customerName; public final ArrayList _itemAl; public Order(TagNode node) { _date = (Date)node.nextStringMappedFieldE("Date", Date.class); _customerId = (int)node.nextIntFieldE("CustomerId"); _customerName = node.nextTextFieldE("CustomerName"); _itemAl = new ArrayList(); boolean finished = false; while (!finished) { TagNode itemNode = node.nextChildN("Item"); if (itemNode != null) { Item item = new Item(itemNode); _itemAl.add(item); } else { finished = true; } } node.verifyNoMoreChildren(); } } public static final Pattern DATE_PATTERN = Pattern.compile("^(\\d\\d\\d\\d)\\/(\\d\\d)\\/(\\d\\d)$"); public static class Date { public final String _dateString; public final int _year; public final int _month; public final int _day; public Date(String dateString) { _dateString = dateString; Matcher matcher = DATE_PATTERN.matcher(dateString); if (!matcher.matches()) { throw new RuntimeException(dateString + " does not match pattern " + DATE_PATTERN.pattern()); } _year = Integer.parseInt(matcher.group(1)); _month = Integer.parseInt(matcher.group(2)); _day = Integer.parseInt(matcher.group(3)); } } public static class Item { public final int _itemId; public final String _itemName; public final Quantity _quantity; public Item(TagNode node) { _itemId = node.nextIntFieldE("ItemId"); _itemName = node.nextTextFieldE("ItemName"); _quantity = new Quantity(node.nextChildE("Quantity")); node.verifyNoMoreChildren(); } } public static class Quantity { public final int _unitSize; public final int _unitQuantity; public Quantity(TagNode node) { _unitSize = node.attributeIntD("unit", 1); _unitQuantity = node.onlyInt(); } } public static void main(String[] args) { File xmlFile = new File(args[0]); TagNode orderNode = XmlReader.xmlFileToRoot(xmlFile, "Order", XmlReadOptions.DEFAULT); Order order = new Order(orderNode); System.out.println("Read order for " + order._customerName + " which has " + order._itemAl.size() + " items"); } } 

您会注意到检索函数以N,E或D结尾。它们指的是当所需数据项不存在时该怎么做。 N代表返回Null,E代表抛出exception,D代表使用默认值。

解决方案不使用外部包,甚至XPath:使用enum “PARSE_MODE”,可能与Stack结合使用:

1)基本解决方案:

a)领域

 private PARSE_MODE parseMode = PARSE_MODE.__UNDEFINED__; // NB: essential that all these enum values are upper case, but this is the convention anyway private enum PARSE_MODE { __UNDEFINED__, ORDER, DATE, CUSTOMERID, ITEM }; private List parseModeStrings = new ArrayList(); private Stack modeBreadcrumbs = new Stack(); 

b)使你的List ,也许在构造函数中:

  for( PARSE_MODE pm : PARSE_MODE.values() ){ // might want to check here that these are indeed upper case parseModeStrings.add( pm.name() ); } 

c) startElementendElement

 @Override public void startElement(String namespaceURI, String localName, String qName, Attributes atts) { String localNameUC = localName.toUpperCase(); // pushing "__UNDEFINED__" would mess things up! But unlikely name for an XML element assert ! localNameUC.equals( "__UNDEFINED__" ); if( parseModeStrings.contains( localNameUC )){ parseMode = PARSE_MODE.valueOf( localNameUC ); // any "policing" to do with which modes are allowed to switch into // other modes could be put here... // in your case, go `new Order()` here when parseMode == ORDER modeBreadcrumbs.push( parseMode ); } else { // typically ignore the start of this element... } } @Override private void endElement(String uri, String localName, String qName) throws Exception { String localNameUC = localName.toUpperCase(); if( parseModeStrings.contains( localNameUC )){ // will not fail unless XML structure which is malformed in some way // or coding error in use of the Stack, etc.: assert modeBreadcrumbs.pop() == parseMode; if( modeBreadcrumbs.empty() ){ parseMode = PARSE_MODE.__UNDEFINED__; } else { parseMode = modeBreadcrumbs.peek(); } } else { // typically ignore the end of this element... } } 

… 那么,这意味着什么? 在任何时候你都知道你所处的“解析模式”……你还可以查看Stack modeBreadcrumbs如果你需要找出你通过的其他解析模式来到这里… 。

然后你的characters方法变得更加清洁:

 public void characters(char[] ch, int start, int length) throws SAXException { switch( parseMode ){ case DATE: // PS - this SimpleDateFormat object can be a field: it doesn't need to be created hundreds of times SimpleDateFormat formatter. ... String value = ... ... break; case CUSTOMERID: order.setCustomerId( ... break; case ITEM: item = new Item(); // this next line probably won't be needed: when you get to endElement, if // parseMode is ITEM, the previous mode will be restored automatically // isItem = false ; } } 

2)更“专业”的解决方案:
abstract类,具体类必须扩展,然后无法修改Stack等。请注意,这将检查qName而不是localName 。 从而:

 public abstract class AbstractSAXHandler extends DefaultHandler { protected enum PARSE_MODE implements SAXHandlerParseMode { __UNDEFINED__ }; // abstract: the concrete subclasses must populate... abstract protected Collection> getPossibleModes(); // private Stack modeBreadcrumbs = new Stack(); private Collection> possibleModes; private Map> nameToEnumMap; private Map> getNameToEnumMap(){ // lazy creation and population of map if( nameToEnumMap == null ){ if( possibleModes == null ){ possibleModes = getPossibleModes(); } nameToEnumMap = new HashMap>(); for( Enum possibleMode : possibleModes ){ nameToEnumMap.put( possibleMode.name(), possibleMode ); } } return nameToEnumMap; } protected boolean isLegitimateModeName( String name ){ return getNameToEnumMap().containsKey( name ); } protected SAXHandlerParseMode getParseMode() { return modeBreadcrumbs.isEmpty()? PARSE_MODE.__UNDEFINED__ : modeBreadcrumbs.peek(); } @Override public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { try { _startElement(uri, localName, qName, attributes); } catch (Exception e) { throw new RuntimeException(e); } } // override in subclasses (NB I think caught Exceptions are not a brilliant design choice in Java) protected void _startElement(String uri, String localName, String qName, Attributes attributes) throws Exception { String qNameUC = qName.toUpperCase(); // very undesirable ever to push "UNDEFINED"! But unlikely name for an XML element assert !qNameUC.equals("__UNDEFINED__") : "Encountered XML element with qName \"__UNDEFINED__\"!"; if( getNameToEnumMap().containsKey( qNameUC )){ Enum newMode = getNameToEnumMap().get( qNameUC ); modeBreadcrumbs.push( (SAXHandlerParseMode)newMode ); } } @Override public void endElement(String uri, String localName, String qName) throws SAXException { try { _endElement(uri, localName, qName); } catch (Exception e) { throw new RuntimeException(e); } } // override in subclasses protected void _endElement(String uri, String localName, String qName) throws Exception { String qNameUC = qName.toUpperCase(); if( getNameToEnumMap().containsKey( qNameUC )){ modeBreadcrumbs.pop(); } } public List showModeBreadcrumbs(){ return org.apache.commons.collections4.ListUtils.unmodifiableList( modeBreadcrumbs ); } } interface SAXHandlerParseMode { } 

那么,具体子类的突出部分:

 private enum PARSE_MODE implements SAXHandlerParseMode { ORDER, DATE, CUSTOMERID, ITEM }; private Collection> possibleModes; @Override protected Collection> getPossibleModes() { // lazy initiation if (possibleModes == null) { List parseModes = new ArrayList( Arrays.asList(PARSE_MODE.values()) ); possibleModes = new ArrayList>(); for( SAXHandlerParseMode parseMode : parseModes ){ possibleModes.add( PARSE_MODE.valueOf( parseMode.toString() )); } // __UNDEFINED__ mode (from abstract superclass) must be added afterwards possibleModes.add( AbstractSAXHandler.PARSE_MODE.__UNDEFINED__ ); } return possibleModes; } 

PS这是更复杂的东西的起点:例如,你可以设置一个List ,它与Stack保持同步: Objects可以是你想要的任何东西,让你“回到Stack ”进入你正在处理的人的上升“XML节点”。 但是,不要使用MapStack可能会PARSE_MODE包含相同的PARSE_MODE对象。 这实际上说明了所有树状结构的基本特征: 没有单独的节点 (这里:解析模式) 孤立存在:它的标识总是由通向它的整个路径定义

  import java.io.File; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import java.util.ArrayList; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import javax.xml.xpath.XPath; import javax.xml.xpath.XPathConstants; import javax.xml.xpath.XPathExpression; import javax.xml.xpath.XPathFactory; import org.w3c.dom.Document; import org.w3c.dom.NodeList; public class JXML { private DocumentBuilder builder; private Document doc = null; private DocumentBuilderFactory factory ; private XPathExpression expr = null; private XPathFactory xFactory; private XPath xpath; private String xmlFile; public static ArrayList XMLVALUE ; public JXML(String xmlFile){ this.xmlFile = xmlFile; } private void xmlFileSettings(){ try { factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true); xFactory = XPathFactory.newInstance(); xpath = xFactory.newXPath(); builder = factory.newDocumentBuilder(); doc = builder.parse(xmlFile); } catch (Exception e){ System.out.println(e); } } public String[] selectQuery(String query){ xmlFileSettings(); ArrayList records = new ArrayList(); try { expr = xpath.compile(query); Object result = expr.evaluate(doc, XPathConstants.NODESET); NodeList nodes = (NodeList) result; for (int i=0; i 

}