Jsoup:如何在2个标头标签之间获取所有html
我想在2 h1标签之间获取所有html。 实际任务是将html分解为基于h1(标题1)标记的框架(章节)。
感谢任何帮助。
谢谢Sunil
如果要获取并处理两个连续h1
标记之间的所有元素,则可以对兄弟姐妹进行处理。 这是一些示例代码:
public static void h1s() { String html = "" + "" + "" + " title 1
" + " hello 1
" + " " + " " + " hello " + " world " + " 1 " + " " + "
" + " title 2
" + " hello 2
" + " " + " " + " hello " + " world " + " 2 " + " " + "
" + " title 3
" + " hello 3
" + " " + " " + " hello " + " world " + " 3 " + " " + "
" + "" + ""; Document doc = Jsoup.parse(html); Element firstH1 = doc.select("h1").first(); Elements siblings = firstH1.siblingElements(); List elementsBetween = new ArrayList (); for (int i = 1; i < siblings.size(); i++) { Element sibling = siblings.get(i); if (! "h1".equals(sibling.tagName())) elementsBetween.add(sibling); else { processElementsBetween(elementsBetween); elementsBetween.clear(); } } if (! elementsBetween.isEmpty()) processElementsBetween(elementsBetween); } private static void processElementsBetween( List elementsBetween) { System.out.println("---"); for (Element element : elementsBetween) { System.out.println(element); } }
我不知道Jsoup那么好,但直接的方法看起来像这样:
public class Test { public static void main(String[] args){ Document document = Jsoup.parse("" + "First
text text text
" + "Second
more text" + ""); List> articles = new ArrayList>(); List currentArticle = null; for(Node node : document.getElementsByTag("body").get(0).childNodes()){ if(node.outerHtml().startsWith("")){ currentArticle = new ArrayList(); articles.add(currentArticle); } currentArticle.add(node); } for(List article : articles){ for(Node node : article){ System.out.println(node); } System.out.println("------- new page ---------"); } } }
你知道文章的结构吗?它总是一样的吗? 你想对这些文章做什么? 您是否考虑过在客户端拆分它们? 这将是一个简单的jQuery工作。
迭代连续
元素之间的元素似乎很好,除了一件事。 文本不属于任何标记,例如
this
。 为了解决这个问题,我实现了splitElemText
函数来获取此文本。 首先使用此方法拆分整个父元素。 然后,除元素外,处理分割文本中的合适条目。 如果你想要原始html,删除对htmlToText
调用。
/** Splits the text of the element
elem
by the children * tags. * @return An array of sizec+1
, wherec * is the number of child elements. * Text after
n
th element is found in[n+1]
. */ public static String[] splitElemText(Element elem) { int c = elem.children().size(); String as[] = new String[c + 1]; String sAll = elem.html(); int iBeg = 0; int iChild = 0; for (Element ch : elem.children()) { String sChild = ch.outerHtml(); int iEnd = sAll.indexOf(sChild, iBeg); if (iEnd < 0) { throw new RuntimeException("Tag " + sChild +" not found in its parent: " + sAll); } as[iChild] = htmlToText(sAll.substring(iBeg, iEnd)); iBeg = iEnd + sChild.length(); iChild += 1; } as[iChild] = htmlToText(sAll.substring(iBeg)); assert(iChild == c); return as; } public static String htmlToText(String sHtml) { Document doc = Jsoup.parse(sHtml); return doc.text(); }