Nutch Crawling不适用于特定的URL

我正在使用apache nutch进行爬行。 当我抓取页面http://www.google.co.in 。 它正确抓取页面并生成结果。 但是,当我在该url中添加一个参数时,它无法获取urlhttp://www.google.co.in/search?q=bill+gates任何结果。

 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 100 Injector: starting at 2013-05-27 08:01:57 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-05-27 08:02:11, elapsed: 00:00:14 Generator: starting at 2013-05-27 08:02:11 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130527080219 Generator: finished at 2013-05-27 08:02:26, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-05-27 08:02:26 Fetcher: segment: crawl/segments/20130527080219 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.google.co.in/search?q=bill+gates Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-05-27 08:02:33, elapsed: 00:00:07 ParseSegment: starting at 2013-05-27 08:02:33 ParseSegment: segment: crawl/segments/20130527080219 ParseSegment: finished at 2013-05-27 08:02:40, elapsed: 00:00:07 CrawlDb update: starting at 2013-05-27 08:02:40 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20130527080219] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-05-27 08:02:54, elapsed: 00:00:13 Generator: starting at 2013-05-27 08:02:54 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2013-05-27 08:03:01 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/home/muthu/workspace/webcrawler/crawl/segments/20130527080219 LinkDb: finished at 2013-05-27 08:03:08, elapsed: 00:00:07 crawl finished: crawl 

我已经添加了代码

 # skip URLs containing certain characters as probable queries, etc. -.*[?*!@=].* 

为什么会这样? 如果我添加参数可以获取url? 在此先感谢您的帮助。

Nutch crawler服从robots.txt,如果你看到位于http://www.google.co.in/robots.txt的 robots.txt,你会发现/ search不允许抓取。