org.apache.nutch.indexer.anchor
Class AnchorIndexingFilter
java.lang.Object
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
- All Implemented Interfaces:
- Configurable, IndexingFilter, FieldPluggable, Pluggable
public class AnchorIndexingFilter
- extends Object
- implements IndexingFilter
Indexing filter that offers an option to either index all inbound anchor text for
a document or deduplicate anchors. Deduplication does have it's con's,
- See Also:
anchorIndexingFilter.deduplicate} in nutch-default.xml.
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
AnchorIndexingFilter
public AnchorIndexingFilter()
setConf
public void setConf(Configuration conf)
- Set the
Configuration
object
- Specified by:
setConf
in interface Configurable
getConf
public Configuration getConf()
- Get the
Configuration
object
- Specified by:
getConf
in interface Configurable
addIndexBackendOptions
public void addIndexBackendOptions(Configuration conf)
filter
public NutchDocument filter(NutchDocument doc,
String url,
WebPage page)
throws IndexingException
- The
AnchorIndexingFilter
filter object which supports boolean
configuration settings for the deduplication of anchors.
See anchorIndexingFilter.deduplicate
in nutch-default.xml.
- Specified by:
filter
in interface IndexingFilter
- Parameters:
doc
- The NutchDocument
objecturl
- URL to be filtered for anchor textpage
- WebPage
object relative to the URL
- Returns:
- filtered NutchDocument
- Throws:
IndexingException
getFields
public Collection<WebPage.Field> getFields()
- Gets all the fields for a given
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed. All extensions that work on WebPage are able to specify what fields
they need.
- Specified by:
getFields
in interface FieldPluggable
Copyright © 2012 The Apache Software Foundation