获取url时，HtmlUnitDriver会导致问题

我有一个使用Selenium库用Java开发的页面爬虫。抓取工具通过一个网站启动，该网站通过Javascript 3应用程序启动，这些应用程序在弹出窗口中显示为HTML。

启动2个应用程序时，爬网程序没有问题，但在第3个爬虫程序上，爬网程序永远冻结。

我正在使用的代码类似于

public void applicationSelect() { ... //obtain url by parsing tag href attributed ... this.driver = new HtmlUnitDriver(BrowserVersion.INTERNET_EXPLORER_8); this.driver.seJavascriptEnabled(true); this.driver.get(url); //the code does not execute after this point for the 3rd app ... }

我也试过通过以下代码点击web元素

 public void applicationSelect() { ... WebElement element = this.driver.findElementByLinkText("linkText"); element.click(); //the code does not execute after this point for the 3rd app ... }

单击它会产生完全相同的结果。对于上面的代码，我确保我得到了正确的元素。

有谁能告诉我我可能遇到的问题是什么？

在申请方面，我不能透露有关HTML代码的任何信息。我知道这使得解决问题变得更加困难，并且我提前道歉。

===更新2013-04-10 ===

所以，我将源代码添加到了我的抓取工具中，看到了这个.driver.get（url）中的哪些内容被卡住了。

基本上，驱动程序在无限刷新循环中丢失。在由HtmlUnitDriver实例化的WebClient对象中，加载了一个HtmlPage，它不断刷新，似乎没有结束。

以下是来自WaitingRefreshHandler的代码，该代码包含在com.gargoylesoftware.htmlunit中：

 public void handleRefresh(final Page page, final URL url, final int requestedWait) throws IOException { int seconds = requestedWait; if (seconds > maxwait_ && maxwait_ > 0) { seconds = maxwait_; } try { Thread.sleep(seconds * 1000); } catch (final InterruptedException e) { /* This can happen when the refresh is happening from a navigation that started * from a setTimeout or setInterval. The navigation will cause all threads to get * interrupted, including the current thread in this case. It should be safe to * ignore it since this is the thread now doing the navigation. Eventually we should * refactor to force all navigation to happen back on the main thread. */ if (LOG.isDebugEnabled()) { LOG.debug("Waiting thread was interrupted. Ignoring interruption to continue navigation."); } } final WebWindow window = page.getEnclosingWindow(); if (window == null) { return; } final WebClient client = window.getWebClient(); client.getPage(window, new WebRequest(url)); }

指令“client.getPage（window，new WebRequest（url））”再次调用WebClient来重新加载页面，只是再一次调用这个相同的刷新方法。这似乎继续下去，不能仅仅因为“Thread.sleep（seconds * 1000）”而快速填满内存，这会迫使3m等待再次尝试。

有没有人对如何解决这个问题有任何建议？我有一个建议是创建2个新的HtmlUnitDriver和WebClient类，它们扩展了原始的类。然后覆盖相关方法以避免此问题。

再次感谢。

我通过创建一个什么都不做的RefreshHandler类解决了我永恒的刷新问题：

 public class RefreshHandler implements com.gargoylesoftware.htmlunit.RefreshHandler { public RefreshHandler() { } public void handleRefresh(final Page page, final URL url, final int secods) { } }

另外，我扩展了HtmlUnitDriver类，并通过重写方法modifyWebClient，设置了新的RefreshHandler：

 public class HtmlUnitDriverExt extends HtmlUnitDriver { public HtmlUnitDriverExt(BrowserVersion version) { super(version); } @Override protected WebClient modifyWebClient(WebClient client) { client.setRefreshHandler(new RefreshHandler()); return client; } }

方法modifyWebClient是在HtmlUnitDriver中为此目的而创建的无操作方法。

干杯。

获取url时，HtmlUnitDriver会导致问题

示例HtmlUnit测试失败

生产中的HtmlUnit + Selenium

如何让2个HtmlUnit的WebClient使用相同的cookie？

HtmlUnit在浏览页面时是否加载图像？

当我使用jsoup或htmlunit获取页面时，href字段丢失

如何使用HtmlUnit获取HTML页面

HtmlUnit 2.9 jar执行JavaScript

使用HtmlUnit WebClient传递每个请求的基本身份validation凭据

错误E：使用HtmlUnit执行javascript

链接中的HtmlUnit和JavaScript