Node.getTextContent()有一种方法可以获取当前节点的文本内容,而不是后代的文本

Node.getTextContent()返回当前节点及其后代的文本内容。

有没有办法获取当前节点的文本内容,而不是后代的文本。

 XML is a browser based XML editor editor allows users to edit XML data in an intuitive word processor.  

预期产出

 paragraph = is a editor allows users to edit XML data in an intuitive word processor. link = XML strong = browser based XML editor 

我尝试下面的代码

 String str = ""+ "XML"+ " is a "+ "browser based XML editor"+ "editor allows users to edit XML data in an intuitive word processor."+ ""; org.w3c.dom.Document domDoc = null; DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder; try { docBuilder = docFactory.newDocumentBuilder(); ByteArrayInputStream bis = new ByteArrayInputStream(str.getBytes()); domDoc = docBuilder.parse(bis); } catch (ParserConfigurationException e1) { e1.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } DocumentTraversal traversal = (DocumentTraversal) domDoc; NodeIterator iterator = traversal.createNodeIterator( domDoc.getDocumentElement(), NodeFilter.SHOW_ELEMENT, null, true); for (Node n = iterator.nextNode(); n != null; n = iterator.nextNode()) { String tagname = ((Element) n).getTagName(); System.out.println(tagname + "=" + ((Element)n).getTextContent()); } 

但它给出了这样的输出

 paragraph=XML is a browser based XML editoreditor allows users to edit XML data in an intuitive word processor. link=XML strong=browser based XML editor 

请注意, 段落元素包含链接标记的文本,我不想要。 请提出一些想法?

你想要的是过滤节点子节点,只保留节点类型为Node.TEXT_NODE节点。

这是一个返回所需内容的方法示例

 public static String getFirstLevelTextContent(Node node) { NodeList list = node.getChildNodes(); StringBuilder textContent = new StringBuilder(); for (int i = 0; i < list.getLength(); ++i) { Node child = list.item(i); if (child.getNodeType() == Node.TEXT_NODE) textContent.append(child.getTextContent()); } return textContent.toString(); } 

在你的例子中,它意味着:

 String str = "" + // "XML" + // " is a " + // "browser based XML editor" + // "editor allows users to edit XML data in an intuitive word processor." + // ""; Document domDoc = null; try { DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docFactory.newDocumentBuilder(); ByteArrayInputStream bis = new ByteArrayInputStream(str.getBytes()); domDoc = docBuilder.parse(bis); } catch (Exception e) { e.printStackTrace(); } DocumentTraversal traversal = (DocumentTraversal) domDoc; NodeIterator iterator = traversal.createNodeIterator(domDoc.getDocumentElement(), NodeFilter.SHOW_ELEMENT, null, true); for (Node n = iterator.nextNode(); n != null; n = iterator.nextNode()) { String tagname = ((Element) n).getTagName(); System.out.println(tagname + "=" + getFirstLevelTextContent(n)); } 

输出:

 paragraph= is a editor allows users to edit XML data in an intuitive word processor. link=XML strong=browser based XML editor 

它的作用是迭代Node的所有子节点,只保留TEXT(从而排除注释,节点等)并累积它们各自的文本内容。

NodeElement没有直接的方法只能获取第一级的文本内容。

如果您将最后一个for循环更改为以下循环,则它会按您的意愿运行

 for (Node n = iterator.nextNode(); n != null; n = iterator.nextNode()) { String tagname = ((Element) n).getTagName(); StringBuilder content = new StringBuilder(); NodeList children = n.getChildNodes(); for(int i=0; i
		      	

我使用Java 8流和辅助类来做到这一点:

 import java.util.*; import org.w3c.dom.Node; import org.w3c.dom.NodeList; public class NodeLists { /** converts a NodeList to java.util.List of Node */ static List list(NodeList nodeList) { List list = new ArrayList<>(); for(int i=0;i 

接着

  NodeLists.list(node) .stream() .filter(node->node.getNodeType()==Node.TEXT_NODE) .map(Node::getTextContent) .reduce("",(s,t)->s+t); 

隐式地没有任何实际节点文本的function,但只需一个简单的技巧即可。 询问node.getTextContent()是否包含“\ n”,如果是这种情况,那么实际节点没有任何文本。

希望这有帮助。