Tag: web crawler

StormCrawler在抓取一个域完成后执行操作: 当爬虫完成对一个域的爬行时，我想做一个动作（在我的情况下，将一个元组发送到一个螺栓）。我看到StormCrawler能够在给定的间隔后重新访问网站。在同时抓取多个域的方案中，哪个组件或如何查看一个域何时完成爬网？我目前的设置是使用StormCrawler与Elasticsearch和Kibana。

Nutch Crawling不适用于特定的URL: 我正在使用apache nutch进行爬行。当我抓取页面http://www.google.co.in 。它正确抓取页面并生成结果。但是，当我在该url中添加一个参数时，它无法获取urlhttp://www.google.co.in/search?q=bill+gates任何结果。 solrUrl is not set, indexing will be skipped… crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 100 Injector: starting at 2013-05-27 08:01:57 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls […]

将线程实现到Java Web Crawler中: 这是我写的原始网络爬虫:(仅供参考） https://github.com/domshahbazi/java-webcrawler/tree/master 这是一个简单的网络爬虫，它访问给定的初始网页，从页面中删除所有链接并将它们添加到队列（LinkedList），然后逐个弹出它们，每次访问，循环再次开始。为了加速我的程序和学习，我尝试使用线程实现，这样我就可以同时运行多个线程，在更短的时间内索引更多的页面。以下是每个class级：主要课程 public class controller { public static void main(String args[]) throws InterruptedException { DataStruc data = new DataStruc(“http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12”); Thread crawl1 = new Crawler(data); Thread crawl2 = new Crawler(data); crawl1.start(); crawl2.start(); } } 爬虫类（线程） public class Crawler extends Thread { /** Instance of Data Structure **/ DataStruc data; /** Number […]

增加爬虫程序中的线程数: This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java public class MyCrawler extends WebCrawler { Pattern filters = Pattern.compile(“.*(\\.(css|js|bmp|gif|jpe?g” + “|png|tiff?|mid|mp2|mp3|mp4” + “|wav|avi|mov|mpeg|ram|m4v|pdf” + “|rm|smil|wmv|swf|wma|zip|rar|gz))$”); /* * You should implement this function to specify * whether the given URL should be visited or not. */ public boolean shouldVisit(WebURL url) […]

限制URL仅限种子URL域crawler4j: 我希望crawler4j以这样的方式访问页面，使它们只属于种子中的域。种子中有多个域。我该怎么做？假设我正在添加种子URL： www.google.com www.yahoo.com www.wikipedia.com 现在我开始抓取，但我希望我的抓取工具仅在以上三个域中访问页面（就像shouldVisit() ）。显然有外部链接，但我希望我的抓取工具仅限于这些域。子域，子文件夹是可以的，但不在这些域之外。

JSP页面导入问题。类文件放在WEB-INF / classes中的包中: 我有一个运行Web应用程序crawler_GUI，它的buildpath中有另一个java项目jspider。（我用eclipse galileo） GUI使用jspider项目作为其后端。访问http://sofzh.miximages.com/java/avmszn.jpg了解结构 JSP创建jspider对象的实例。首先，我没有WEB-INF / classes文件夹中的类，我纠正了这个错误。现在它似乎工作，并没有显示任何错误，但没有任何任务执行。这是代码： JSP <%//URL baseURL = new URL(Crawler.SelectedSites.get(0)); URL baseURL = new URL("http://www.buy.com"); System.out.println("******"); ESpider espider = new ESpider(baseURL); * s打印出来。 ESpider.java public ESpider(URL baseURL) throws Exception { super(baseURL); System.out.println(“test”); } 它不打印“测试”。事实上父母的构造函数甚至没有被调用。同时也没有显示错误。我怎样才能解决这个问题？

用于读取javascript生成内容的java html解析器: 我使用jsoup通过以下函数读取网页。 public Document getDocuement(String url){ Document doc = null; try { doc = Jsoup.connect(url).timeout(20*1000).userAgent(“Mozilla”).get(); } catch (Exception e) { return null; } return doc; } 但每当我试图阅读包含javascript生成内容的网页时， jsoup都不会读取这些内容。即，页面的实际内容是通过一些javascript调用加载的。因此它不存在于该链接的页面源中。例如，这个博客： http ： //blog.rapporter.net/search/label/r 。有没有办法在使用Jsoup解析页面时获取javascript生成的内容？如果没有请建议任何可以解决这个问题的java html解析器..

htmlunit：指定了无效或非法的选择器: 我试图用htmlunit模拟登录。虽然我根据例子编写了我的代码，但我遇到了一个无聊的问题。以下是我从控制台中获取的一些消息。 runtimeError: message=[An invalid or illegal selector was specified (selector: ‘*,:x’ error: Invalid selector: *:x).] sourceName=[http://user.mofangge.com/Scripts/inc/jquery-1.10.2.js] line=[1640] lineSource=[null] lineOffset=[0] WARNING: Obsolete content type encountered: ‘application/x-javascript’. CSS error: ‘http://user.mofangge.com/Content/Css/Style1/Main.css’ [1:1] Error in style sheet. (Invalid token “\u9518”. Was expecting one of: , , , “”, , , , , , , “.”, “:”, “*”, […]

语法错误，插入“… VariableDeclaratorId”以完成FormalParameterList: 我在使用此代码时遇到一些问题： import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class Controller { String crawlStorageFolder = “/data/crawl/root”; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /* * Instantiate the controller for this crawl. */ PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController […]

使用JSoup登录Linkedin: 我需要用Jsoup登录Linkedin，最好是。这是我用来登录其他网站但它不适用于Linkedin。 Connection.Response res = Jsoup .connect(“https://www.linkedin.com/uas/login?goback=&trk=hb_signin”) .data(“session_key”, mail, “session_password”, password) .method(Connection.Method.POST) .timeout(60000). // Also tried “https://www.linkedin.com/uas/login-submit” Map loginCookies = res.cookies(); //Checking a profile to see if it was succesful or if it returns the login page. Document currentPage = Jsoup.connect(someProfileLink).cookies(loginCookies).timeout(10000). System.out.println(“” + currentPage.text()); 我究竟做错了什么？我需要能够通过使用网络爬虫来获取用户配置文件，但无论我尝试什么，我都无法获得登录cookie。

Tag: web crawler

StormCrawler在抓取一个域完成后执行操作

Nutch Crawling不适用于特定的URL

将线程实现到Java Web Crawler中

增加爬虫程序中的线程数

限制URL仅限种子URL域crawler4j

JSP页面导入问题。类文件放在WEB-INF / classes中的包中

用于读取javascript生成内容的java html解析器

htmlunit：指定了无效或非法的选择器

语法错误，插入“… VariableDeclaratorId”以完成FormalParameterList

使用JSoup登录Linkedin

如何在java中生成6个不同的随机数

JavaFX FileChooser引发错误（可能很容易修复，但仍然困惑）

SAP JBOSS与jar有关

在方法调用期间创建的本地Java对象的生命周期

ant将所有参数传递给java任务

使用“xs：any”获取“编译器无法遵守此javaType自定义”

不inheritanceObject类的类

初始化本地数据存储exception：没有为此线程注册API环境

监视Java应用程序的内存使用情况

我怎样才能在bluej中使用lwjgl？

避免n + 1渴望获取子集合元素关联

在哈希映射上运行perceptron算法functionvecteur：java

为什么Java内部类需要外部类的变量是最终的？

字段读取和volatile的同步之间的区别

Java和C＃之间的通信

Tag: web crawler

StormCrawler在抓取一个域完成后执行操作

Nutch Crawling不适用于特定的URL

将线程实现到Java Web Crawler中

增加爬虫程序中的线程数

限制URL仅限种子URL域crawler4j

JSP页面导入问题。 类文件放在WEB-INF / classes中的包中

用于读取javascript生成内容的java html解析器

htmlunit：指定了无效或非法的选择器

语法错误，插入“… VariableDeclaratorId”以完成FormalParameterList

使用JSoup登录Linkedin

如何在java中生成6个不同的随机数

JavaFX FileChooser引发错误（可能很容易修复，但仍然困惑）

SAP JBOSS与jar有关

在方法调用期间创建的本地Java对象的生命周期

ant将所有参数传递给java任务

使用“xs：any”获取“编译器无法遵守此javaType自定义”

不inheritanceObject类的类

初始化本地数据存储exception：没有为此线程注册API环境

监视Java应用程序的内存使用情况

我怎样才能在bluej中使用lwjgl？

避免n + 1渴望获取子集合元素关联

在哈希映射上运行perceptron算法functionvecteur：java

为什么Java内部类需要外部类的变量是最终的？

字段读取和volatile的同步之间的区别

Java和C＃之间的通信

JSP页面导入问题。类文件放在WEB-INF / classes中的包中