Jsoup:如何在2个标头标签之间获取所有html

我想在2 h1标签之间获取所有html。 实际任务是将html分解为基于h1(标题1)标记的框架(章节)。

感谢任何帮助。

谢谢Sunil

如果要获取并处理两个连续h1标记之间的所有元素,则可以对兄弟姐妹进行处理。 这是一些示例代码:

 public static void h1s() { String html = "" + "" + "" + " 

title 1

" + "

hello 1

" + " " + " " + " " + " " + " " + " " + "
helloworld1
" + "

title 2

" + "

hello 2

" + " " + " " + " " + " " + " " + " " + "
helloworld2
" + "

title 3

" + "

hello 3

" + " " + " " + " " + " " + " " + " " + "
helloworld3
" + "" + ""; Document doc = Jsoup.parse(html); Element firstH1 = doc.select("h1").first(); Elements siblings = firstH1.siblingElements(); List elementsBetween = new ArrayList(); for (int i = 1; i < siblings.size(); i++) { Element sibling = siblings.get(i); if (! "h1".equals(sibling.tagName())) elementsBetween.add(sibling); else { processElementsBetween(elementsBetween); elementsBetween.clear(); } } if (! elementsBetween.isEmpty()) processElementsBetween(elementsBetween); } private static void processElementsBetween( List elementsBetween) { System.out.println("---"); for (Element element : elementsBetween) { System.out.println(element); } }

我不知道Jsoup那么好,但直接的方法看起来像这样:

 public class Test { public static void main(String[] args){ Document document = Jsoup.parse("" + "

First

text text text

" + "

Second

more text" + ""); List> articles = new ArrayList>(); List currentArticle = null; for(Node node : document.getElementsByTag("body").get(0).childNodes()){ if(node.outerHtml().startsWith("

")){ currentArticle = new ArrayList(); articles.add(currentArticle); } currentArticle.add(node); } for(List article : articles){ for(Node node : article){ System.out.println(node); } System.out.println("------- new page ---------"); } } }

你知道文章的结构吗?它总是一样的吗? 你想对这些文章做什么? 您是否考虑过在客户端拆分它们? 这将是一个简单的jQuery工作。

迭代连续元素之间的元素似乎很好,除了一件事。 文本不属于任何标记,例如

this

。 为了解决这个问题,我实现了splitElemText函数来获取此文本。 首先使用此方法拆分整个父元素。 然后,除元素外,处理分割文本中的合适条目。 如果你想要原始html,删除对htmlToText调用。

 /** Splits the text of the element elem by the children * tags. * @return An array of size c+1, where c * is the number of child elements. * 

Text after nth element is found in [n+1]. */ public static String[] splitElemText(Element elem) { int c = elem.children().size(); String as[] = new String[c + 1]; String sAll = elem.html(); int iBeg = 0; int iChild = 0; for (Element ch : elem.children()) { String sChild = ch.outerHtml(); int iEnd = sAll.indexOf(sChild, iBeg); if (iEnd < 0) { throw new RuntimeException("Tag " + sChild +" not found in its parent: " + sAll); } as[iChild] = htmlToText(sAll.substring(iBeg, iEnd)); iBeg = iEnd + sChild.length(); iChild += 1; } as[iChild] = htmlToText(sAll.substring(iBeg)); assert(iChild == c); return as; } public static String htmlToText(String sHtml) { Document doc = Jsoup.parse(sHtml); return doc.text(); }