使用java解析robot.txt并确定是否允许使用url

我目前在应用程序中使用jsoup来解析和分析网页。但我想确保我遵守robot.txt规则并且只访问允许的页面。

我很确定jsoup不是为此制作的,而是关于网页抓取和解析。 所以我计划让函数/模块读取域/站点的robot.txt,并确定我是否允许访问的URL。

我做了一些研究,发现了以下内容。但是我不确定这些,所以如果有人做同样的项目,其中涉及到robot.txt解析请分享你的想法和想法会很棒。

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

如果您 – 或其他人 – 仍在寻找一种方法来做到这一点,这是一个迟到的答案。 我在版本0.2中使用https://code.google.com/p/crawler-commons/ ,它似乎运行良好。 以下是我使用的代码的简化示例:

String USER_AGENT = "WhateverBot"; String url = "http://www.....com/"; URL urlObj = new URL(url); String hostId = urlObj.getProtocol() + "://" + urlObj.getHost() + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : ""); Map robotsTxtRules = new HashMap(); BaseRobotRules rules = robotsTxtRules.get(hostId); if (rules == null) { HttpGet httpget = new HttpGet(hostId + "/robots.txt"); HttpContext context = new BasicHttpContext(); HttpResponse response = httpclient.execute(httpget, context); if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) { rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL); // consume entity to deallocate connection EntityUtils.consumeQuietly(response.getEntity()); } else { BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity()); SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser(); rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()), "text/plain", USER_AGENT); } robotsTxtRules.put(hostId, rules); } boolean urlAllowed = rules.isAllowed(url); 

显然,这与Jsoup没有任何关系,只是检查是否允许为某个USER_AGENT抓取给定的URL。 为了获取robots.txt,我在版本4.2.1中使用Apache HttpClient,但这也可以用java.net的东西代替。

请注意,此代码仅检查允许或拒绝,并且不考虑其他robots.txtfunction,如“抓取延迟”。 但是由于crawler-commons也提供了这个function,它可以很容易地添加到上面的代码中。

以上对我不起作用。 我把它设法把它放在一起。 我第一次在4年内做Java,所以我相信这可以改进。

 public static boolean robotSafe(URL url) { String strHost = url.getHost(); String strRobot = "http://" + strHost + "/robots.txt"; URL urlRobot; try { urlRobot = new URL(strRobot); } catch (MalformedURLException e) { // something weird is happening, so don't trust it return false; } String strCommands; try { InputStream urlRobotStream = urlRobot.openStream(); byte b[] = new byte[1000]; int numRead = urlRobotStream.read(b); strCommands = new String(b, 0, numRead); while (numRead != -1) { numRead = urlRobotStream.read(b); if (numRead != -1) { String newCommands = new String(b, 0, numRead); strCommands += newCommands; } } urlRobotStream.close(); } catch (IOException e) { return true; // if there is no robots.txt file, it is OK to search } if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything. { String[] split = strCommands.split("\n"); ArrayList robotRules = new ArrayList<>(); String mostRecentUserAgent = null; for (int i = 0; i < split.length; i++) { String line = split[i].trim(); if (line.toLowerCase().startsWith("user-agent")) { int start = line.indexOf(":") + 1; int end = line.length(); mostRecentUserAgent = line.substring(start, end).trim(); } else if (line.startsWith(DISALLOW)) { if (mostRecentUserAgent != null) { RobotRule r = new RobotRule(); r.userAgent = mostRecentUserAgent; int start = line.indexOf(":") + 1; int end = line.length(); r.rule = line.substring(start, end).trim(); robotRules.add(r); } } } for (RobotRule robotRule : robotRules) { String path = url.getPath(); if (robotRule.rule.length() == 0) return true; // allows everything if BLANK if (robotRule.rule == "/") return false; // allows nothing if / if (robotRule.rule.length() <= path.length()) { String pathCompare = path.substring(0, robotRule.rule.length()); if (pathCompare.equals(robotRule.rule)) return false; } } } return true; } 

你需要帮助类:

 /** * * @author Namhost.com */ public class RobotRule { public String userAgent; public String rule; RobotRule() { } @Override public String toString() { StringBuilder result = new StringBuilder(); String NEW_LINE = System.getProperty("line.separator"); result.append(this.getClass().getName() + " Object {" + NEW_LINE); result.append(" userAgent: " + this.userAgent + NEW_LINE); result.append(" rule: " + this.rule + NEW_LINE); result.append("}"); return result.toString(); } }