如何在java中创建Web爬虫?

嗨,我想在java中创建一个web爬虫,我想从网页中检索一些数据,如标题,描述,并将数据存储在数据库中

如果你想自己使用android API中包含的HttpClient 。

HttpClient的示例用法(您只需要解析:

public class HttpTest { public static void main(String... args) throws ClientProtocolException, IOException { crawlPage("http://www.google.com/"); } static Set checked = new HashSet(); private static void crawlPage(String url) throws ClientProtocolException, IOException { if (checked.contains(url)) return; checked.add(url); System.out.println("Crawling: " + url); HttpClient client = new DefaultHttpClient(); HttpGet request = new HttpGet("http://www.google.com"); HttpResponse response = client.execute(request); Reader reader = null; try { reader = new InputStreamReader(response.getEntity().getContent()); Links links = new Links(); new ParserDelegator().parse(reader, links, true); for (String link : links.list) if (link.startsWith("http://")) crawlPage(link); } finally { if (reader != null) { try { reader.close(); } catch (IOException e) { e.printStackTrace(); } } } } static class Links extends HTMLEditorKit.ParserCallback { List list = new LinkedList(); public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { if (t == HTML.Tag.A) list.add(a.getAttribute(HTML.Attribute.HREF).toString()); } } } 

您可以使用crawler4j。 Crawler4j是一个开源Java爬虫,它为爬网提供了一个简单的界面。 您可以在几个小时内设置multithreadingWeb爬网程序。

看一下这个例子: http : //java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

以及现有的开源搜寻器: http : //java-source.net/open-source/crawlers/java-web-crawler

您可以使用WebCollector: https : //github.com/CrawlScript/WebCollector

基于WebCollector 2.05的演示:

 import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler; import cn.edu.hfut.dmic.webcollector.model.Links; import cn.edu.hfut.dmic.webcollector.model.Page; import java.util.regex.Pattern; import org.jsoup.nodes.Document; /** * Crawl news from yahoo news * * @author hu */ public class YahooCrawler extends BreadthCrawler { /** * @param crawlPath crawlPath is the path of the directory which maintains * information of this crawler * @param autoParse if autoParse is true,BreadthCrawler will auto extract * links which match regex rules from pag */ public YahooCrawler(String crawlPath, boolean autoParse) { super(crawlPath, autoParse); /*start page*/ this.addSeed("http://news.yahoo.com/"); /*fetch url like http://news.yahoo.com/xxxxx*/ this.addRegex("http://news.yahoo.com/.*"); /*do not fetch url like http://news.yahoo.com/xxxx/xxx)*/ this.addRegex("-http://news.yahoo.com/.+/.*"); /*do not fetch jpg|png|gif*/ this.addRegex("-.*\\.(jpg|png|gif).*"); /*do not fetch url contains #*/ this.addRegex("-.*#.*"); } @Override public void visit(Page page, Links nextLinks) { String url = page.getUrl(); /*if page is news page*/ if (Pattern.matches("http://news.yahoo.com/.+html", url)) { /*we use jsoup to parse page*/ Document doc = page.getDoc(); /*extract title and content of news by css selector*/ String title = doc.select("h1[class=headline]").first().text(); String content = doc.select("div[class=body yom-art-content clearfix]").first().text(); System.out.println("URL:\n" + url); System.out.println("title:\n" + title); System.out.println("content:\n" + content); /*If you want to add urls to crawl,add them to nextLink*/ /*WebCollector automatically filters links that have been fetched before*/ /*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/ // nextLinks.add("http://xxxxxx.com"); } } public static void main(String[] args) throws Exception { YahooCrawler crawler = new YahooCrawler("crawl", true); crawler.setThreads(50); crawler.setTopN(100); //crawler.setResumable(true); /*start crawl with depth of 4*/ crawler.start(4); } }