Tag: crawler4j

限制URL仅限种子URL域crawler4j: 我希望crawler4j以这样的方式访问页面，使它们只属于种子中的域。种子中有多个域。我该怎么做？假设我正在添加种子URL： www.google.com www.yahoo.com www.wikipedia.com 现在我开始抓取，但我希望我的抓取工具仅在以上三个域中访问页面（就像shouldVisit() ）。显然有外部链接，但我希望我的抓取工具仅限于这些域。子域，子文件夹是可以的，但不在这些域之外。

语法错误，插入“… VariableDeclaratorId”以完成FormalParameterList: 我在使用此代码时遇到一些问题： import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class Controller { String crawlStorageFolder = “/data/crawl/root”; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /* * Instantiate the controller for this crawl. */ PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController […]

在循环内调用控制器（crawler4j-3.5）: 嗨我在for-loop调用controller ，因为我有超过100个url，所以我在列表中有所有，我将迭代并crawl页面，我也设置了setCustomData的url，因为它不应该离开域。 for (Iterator iterator = ifList.listIterator(); iterator.hasNext();) { String str = iterator.next(); System.out.println(“cheking”+str); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.setCustomData(str); controller.addSeed(str); controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); controller.waitUntilFinish(); } 但是如果我运行上面的代码，在第二个url开始之后第一个url完全爬行并且打印错误如下所示。 50982 [main] INFO edu.uci.ics.crawler4j.crawler.CrawlController – Crawler 1 started. 51982 [Crawler 1] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager – Connection request: [route: {}->http://www.connectzone.in][total kept alive: 0; route allocated: 0 of 100; total […]

使用java解析robot.txt并确定是否允许使用url: 我目前在应用程序中使用jsoup来解析和分析网页。但我想确保我遵守robot.txt规则并且只访问允许的页面。我很确定jsoup不是为此制作的，而是关于网页抓取和解析。所以我计划让函数/模块读取域/站点的robot.txt，并确定我是否允许访问的URL。我做了一些研究，发现了以下内容。但是我不确定这些，所以如果有人做同样的项目，其中涉及到robot.txt解析请分享你的想法和想法会很棒。 http://sourceforge.net/projects/jrobotx/ https://code.google.com/p/crawler-commons/ http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

使用java进行Web爬网（使用Ajax / JavaScript的页面）: 我对这个网络抓取非常新。我正在使用crawler4j来抓取网站。我通过抓取这些网站来收集所需的信息。我的问题是我无法抓取以下网站的内容。 http://www.sciencedirect.com/science/article/pii/S1568494612005741 。我想从上述网站抓取以下信息（请查看附带的屏幕截图）。如果您观察到附加的屏幕截图，则它有三个名称（在红色框中突出显示）。如果单击其中一个链接，您将看到一个弹出窗口，该弹出窗口包含有关该作者的全部信息。我想抓取该弹出窗口中的信息。我使用以下代码来抓取内容。 public class WebContentDownloader { private Parser parser; private PageFetcher pageFetcher; public WebContentDownloader() { CrawlConfig config = new CrawlConfig(); parser = new Parser(config); pageFetcher = new PageFetcher(config); } private Page download(String url) { WebURL curURL = new WebURL(); curURL.setURL(url); PageFetchResult fetchResult = null; try { […]