org.apache.nutch.protocol.http.api
Class RobotRulesParser
java.lang.Object
org.apache.nutch.protocol.http.api.RobotRulesParser
- All Implemented Interfaces:
- Configurable
public class RobotRulesParser
- extends Object
- implements Configurable
This class handles the parsing of robots.txt
files.
It emits RobotRules objects, which describe the download permissions
as described in RobotRulesParser.
- Author:
- Tom Pierce, Mike Cafarella, Doug Cutting
Nested Class Summary |
static class |
RobotRulesParser.RobotRuleSet
This class holds the rules which were parsed from a robots.txt
file, and can test paths against those rules. |
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
RobotRulesParser
public RobotRulesParser(Configuration conf)
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
getConf
public Configuration getConf()
- Specified by:
getConf
in interface Configurable
getRobotRulesSet
public RobotRulesParser.RobotRuleSet getRobotRulesSet(HttpBase http,
String url)
isAllowed
public boolean isAllowed(HttpBase http,
URL url)
throws ProtocolException,
IOException
- Throws:
ProtocolException
IOException
getCrawlDelay
public long getCrawlDelay(HttpBase http,
URL url)
throws ProtocolException,
IOException
- Throws:
ProtocolException
IOException
main
public static void main(String[] argv)
- command-line main for testing
Copyright © 2012 The Apache Software Foundation