Jsoup：如何在2个标头标签之间获取所有html

我想在2 h1标签之间获取所有html。实际任务是将html分解为基于h1（标题1）标记的框架（章节）。

感谢任何帮助。

谢谢Sunil

如果要获取并处理两个连续h1标记之间的所有元素，则可以对兄弟姐妹进行处理。这是一些示例代码：

 public static void h1s() { String html = "" + "" + "" + " title 1
" + " hello 1
" + " " + " " + " " + " " + " " + " " + " hello world 1
" + " title 2
" + " hello 2
" + " " + " " + " " + " " + " " + " " + " hello world 2
" + " title 3
" + " hello 3
" + " " + " " + " " + " " + " " + " " + " hello world 3
" + "" + ""; Document doc = Jsoup.parse(html); Element firstH1 = doc.select("h1").first(); Elements siblings = firstH1.siblingElements(); List elementsBetween = new ArrayList(); for (int i = 1; i < siblings.size(); i++) { Element sibling = siblings.get(i); if (! "h1".equals(sibling.tagName())) elementsBetween.add(sibling); else { processElementsBetween(elementsBetween); elementsBetween.clear(); } } if (! elementsBetween.isEmpty()) processElementsBetween(elementsBetween); } private static void processElementsBetween( List elementsBetween) { System.out.println("---"); for (Element element : elementsBetween) { System.out.println(element); } }

我不知道Jsoup那么好，但直接的方法看起来像这样：

 public class Test { public static void main(String[] args){ Document document = Jsoup.parse("" + "First
text text text
" + "Second
more text" + ""); List> articles = new ArrayList>(); List currentArticle = null; for(Node node : document.getElementsByTag("body").get(0).childNodes()){ if(node.outerHtml().startsWith("")){ currentArticle = new ArrayList(); articles.add(currentArticle); } currentArticle.add(node); } for(List article : articles){ for(Node node : article){ System.out.println(node); } System.out.println("------- new page ---------"); } } }

你知道文章的结构吗？它总是一样的吗？你想对这些文章做什么？您是否考虑过在客户端拆分它们？这将是一个简单的jQuery工作。

迭代连续元素之间的元素似乎很好，除了一件事。文本不属于任何标记，例如



this

。为了解决这个问题，我实现了splitElemText函数来获取此文本。首先使用此方法拆分整个父元素。然后，除元素外，处理分割文本中的合适条目。如果你想要原始html，删除对htmlToText调用。

 /** Splits the text of the element elem by the children * tags. * @return An array of size c+1, where c * is the number of child elements. * Text after nth element is found in [n+1]. */ public static String[] splitElemText(Element elem) { int c = elem.children().size(); String as[] = new String[c + 1]; String sAll = elem.html(); int iBeg = 0; int iChild = 0; for (Element ch : elem.children()) { String sChild = ch.outerHtml(); int iEnd = sAll.indexOf(sChild, iBeg); if (iEnd < 0) { throw new RuntimeException("Tag " + sChild +" not found in its parent: " + sAll); } as[iChild] = htmlToText(sAll.substring(iBeg, iEnd)); iBeg = iEnd + sChild.length(); iChild += 1; } as[iChild] = htmlToText(sAll.substring(iBeg)); assert(iChild == c); return as; } public static String htmlToText(String sHtml) { Document doc = Jsoup.parse(sHtml); return doc.text(); }

Jsoup：如何在2个标头标签之间获取所有html

title 1

title 2

title 3

First

Second

")){ currentArticle = new ArrayList(); articles.add(currentArticle); } currentArticle.add(node); } for(List article : articles){ for(Node node : article){ System.out.println(node); } System.out.println("------- new page ---------"); } } }

如何使用HttpURLConnection和Java中的CookieManager为每个连接使用不同的cookie

Ant产生的Beanshell，“无法为beanshell创建javax脚本引擎”

是否可以在Hibernate中的同一项目中同时使用注释和hbm.xml文件？

如何使用Selenium允许位置访问？

如何递归复制整个目录，包括Java中的父文件夹

如何更改图像的亮度

Spring boot：找不到javassist

使用jQuery访问基于Jersey的RESTful服务

从mysql网站检索时没有选择数据库

剪纸石的算法