Jsoup – 提取文本

我需要从这样的节点中提取文本：

 Some text with tags might go here. Also there are paragraphs
 More text can go without paragraphs

我需要建立：

 Some text with tags might go here. Also there are paragraphs More text can go without paragraphs

Element.text只返回div的所有内容。 Element.ownText – 不在children元素中的所有内容。两者都错了。通过children迭代忽略文本节点。

是否有方法迭代元素的内容以接收文本节点。例如

文本节点 – 一些文本
节点 – 带标签

文本节点 – 可能会在这里。

节点
– 还有段落

文本节点 – 更多文本可以没有段落

节点
–

Element.children（）返回一个Elements对象 – 一个Element对象列表。查看父类Node ，您将看到允许您访问任意节点的方法，而不仅仅是Elements，例如Node.childNodes（）。

public static void main(String[] args) throws IOException { String str = "" + " Some text with tags might go here." + " Also there are paragraphs " + " More text can go without paragraphs " + " "; Document doc = Jsoup.parse(str); Element div = doc.select("div").first(); int i = 0; for (Node node : div.childNodes()) { i++; System.out.println(String.format("%d %s %s", i, node.getClass().getSimpleName(), node.toString())); } }

结果：

1个TextNode 一些文字 2元素带标签 3 TextNode可能会在这里。 4元素
还有段落 5 TextNode更多文本可以没有段落 6个元素

for (Element el : doc.select("body").select("*")) { for (TextNode node : el.textNodes()) { node.text() )); } }

假设您只想要文本（没有标签）我的解决方案如下。
输出是：
一些带标签的文字可能会在这里。还有段落。更多文字可以没有段落

public static void main(String[] args) throws IOException { String str = "" + " Some text with tags might go here." + " Also there are paragraphs. " + " More text can go without paragraphs " + " "; Document doc = Jsoup.parse(str); Element div = doc.select("div").first(); StringBuilder builder = new StringBuilder(); stripTags(builder, div.childNodes()); System.out.println("Text without tags: " + builder.toString()); } /** * Strip tags from a List of typeNode * @param builder StringBuilder : input and output * @param nodesList List of type Node */ public static void stripTags (StringBuilder builder, List nodesList) { for (Node node : nodesList) { String nodeName = node.nodeName(); if (nodeName.equalsIgnoreCase("#text")) { builder.append(node.toString()); } else { // recurse stripTags(builder, node.childNodes()); } } }

您可以将TextNode用于此目的：

List bodyTextNode = doc.getElementById("content").textNodes(); String html = ""; for(TextNode txNode:bodyTextNode){ html+=txNode.text(); }

url.getFile（）和getpath（）之间有什么区别？

查看当前时间是否在当前Java的特定时间范围内

泽西客户端上传进度
使用Apache POI进行低内存写入/读取
如何暂停所有正在运行的线程？然后恢复？
有没有办法让GWT程序判断它是处于托管模式还是Web模式？
另一个类中的动作监听器 – java
枚举中的枚举
在Tomcat 5上控制WEB-INF / lib中jar的类路径排序？
如何将图像从Java Applet发送到JavaScript？
如何使用jackson将java对象序列化为xml属性？

Jsoup – 提取文本

为什么这个Hotspot JVM选项不是默认选项？ -XX：+ PrintConcurrentLocks

Java 9中的JavaLangAccess和SharedSecrets

在哪里可以找到javax.websocket .jars以便在项目中使用

使用springfox和Swagger2时，为什么v2 / api-docs是默认URL？

如何在Java中实现多重inheritance

在Eclipse Java项目之间添加引用

跨多个WAR文件的java Web模板

Java API拨打电话

Axis2 Web服务客户端生成 – 无需修改客户端的类型

Java套接字：程序停在socket.getInputStream（）没有错误？