AnchorIndexingFilter (apache-nutch 2.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.indexer.anchor
Class AnchorIndexingFilter

java.lang.Object
  org.apache.nutch.indexer.anchor.AnchorIndexingFilter

All Implemented Interfaces:: Configurable, IndexingFilter, FieldPluggable, Pluggable

public class AnchorIndexingFilter
extends Object
implements IndexingFilter
extends Object
implements IndexingFilter

Indexing filter that offers an option to either index all inbound anchor text for a document or deduplicate anchors. Deduplication does have it's con's,

See Also:: anchorIndexingFilter.deduplicate} in nutch-default.xml.

Field Summary
`static org.slf4j.Logger`	`LOG`

Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
`X_POINT_ID`

Constructor Summary
`AnchorIndexingFilter()`

Method Summary
`void`	`addIndexBackendOptions(Configuration conf)`
`NutchDocument`	`filter(NutchDocument doc, String url, WebPage page)` The `AnchorIndexingFilter` filter object which supports boolean configuration settings for the deduplication of anchors.
`Configuration`	`getConf()` Get the `Configuration` object
`Collection<WebPage.Field>`	`getFields()` Gets all the fields for a given `WebPage` Many datastores need to setup the mapreduce job by specifying the fields needed.
`void`	`setConf(Configuration conf)` Set the `Configuration` object

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

AnchorIndexingFilter

public AnchorIndexingFilter()

Method Detail

setConf

public void setConf(Configuration conf)

Set the Configuration object

Specified by:: setConf in interface Configurable

getConf

public Configuration getConf()

Get the Configuration object

Specified by:: getConf in interface Configurable

addIndexBackendOptions

public void addIndexBackendOptions(Configuration conf)

filter

public NutchDocument filter(NutchDocument doc,
                            String url,
                            WebPage page)
                     throws IndexingException

The AnchorIndexingFilter filter object which supports boolean configuration settings for the deduplication of anchors. See anchorIndexingFilter.deduplicate in nutch-default.xml.

Specified by:: filter in interface IndexingFilter

Parameters:: doc - The NutchDocument object; url - URL to be filtered for anchor text; page - WebPage object relative to the URL
Returns:: filtered NutchDocument
Throws:: IndexingException

getFields

public Collection<WebPage.Field> getFields()

Gets all the fields for a given WebPage Many datastores need to setup the mapreduce job by specifying the fields needed. All extensions that work on WebPage are able to specify what fields they need.

Specified by:: getFields in interface FieldPluggable