|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.urlfilter.api.RegexURLFilterBase
public abstract class RegexURLFilterBase
Generic URL filter
based on
regular expressions.
The regular expressions rules are expressed in a file. The file of rules
is provided by each implementation using the
#getRulesFile(Configuration)
method.
The format of this file is made of many rules (one per line):
[+-]<regex>
where plus (+
)means go ahead and index it and minus
(-
)means no.
Field Summary |
---|
Fields inherited from interface org.apache.nutch.net.URLFilter |
---|
X_POINT_ID |
Constructor Summary | |
---|---|
|
RegexURLFilterBase()
Constructs a new empty RegexURLFilterBase |
|
RegexURLFilterBase(File filename)
Constructs a new RegexURLFilter and init it with a file of rules. |
protected |
RegexURLFilterBase(Reader reader)
Constructs a new RegexURLFilter and init it with a Reader of rules. |
|
RegexURLFilterBase(String rules)
Constructs a new RegexURLFilter and inits it with a list of rules. |
Method Summary | |
---|---|
protected abstract RegexRule |
createRule(boolean sign,
String regex)
Creates a new RegexRule . |
String |
filter(String url)
|
Configuration |
getConf()
|
protected abstract Reader |
getRulesReader(Configuration conf)
Returns the name of the file of rules to use for a particular implementation. |
static void |
main(RegexURLFilterBase filter,
String[] args)
Filter the standard input using a RegexURLFilterBase. |
void |
setConf(Configuration conf)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public RegexURLFilterBase()
public RegexURLFilterBase(File filename) throws IOException, IllegalArgumentException
filename
- is the name of rules file.
IOException
IllegalArgumentException
public RegexURLFilterBase(String rules) throws IOException, IllegalArgumentException
rules
- string with a list of rules, one rule per line
IOException
IllegalArgumentException
protected RegexURLFilterBase(Reader reader) throws IOException, IllegalArgumentException
reader
- is a reader of rules.
IOException
IllegalArgumentException
Method Detail |
---|
protected abstract RegexRule createRule(boolean sign, String regex)
RegexRule
.
sign
- of the regular expression.
A true
value means that any URL matching this rule
must be included, whereas a false
value means that any URL matching this rule must be excluded.regex
- is the regular expression associated to this rule.protected abstract Reader getRulesReader(Configuration conf) throws IOException
conf
- is the current configuration.
IOException
public String filter(String url)
filter
in interface URLFilter
public void setConf(Configuration conf)
setConf
in interface Configurable
public Configuration getConf()
getConf
in interface Configurable
public static void main(RegexURLFilterBase filter, String[] args) throws IOException, IllegalArgumentException
filter
- is the RegexURLFilterBase to use for filtering the
standard input.args
- some optional parameters (not used).
IOException
IllegalArgumentException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |