- CACHE - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
- CACHING_FORBIDDEN_ALL - Static variable in interface org.apache.nutch.metadata.Nutch
-
Don't show either original forbidden content or summaries.
- CACHING_FORBIDDEN_CONTENT - Static variable in interface org.apache.nutch.metadata.Nutch
-
Don't show original forbidden content, but show summaries.
- CACHING_FORBIDDEN_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
Sites may request that search engines don't provide access to cached documents.
- CACHING_FORBIDDEN_NONE - Static variable in interface org.apache.nutch.metadata.Nutch
-
Show both original forbidden content and summaries (default).
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.MD5Signature
-
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.Signature
-
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.TextProfileSignature
-
- calculateLastFetchTime(CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method return the last fetch time of the CrawlDatum
- calculateLastFetchTime(CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Calculates last fetch time of the given CrawlDatum.
- CCIndexingFilter - Class in org.creativecommons.nutch
-
Adds basic searchable fields to a document.
- CCIndexingFilter() - Constructor for class org.creativecommons.nutch.CCIndexingFilter
-
- CCParseFilter - Class in org.creativecommons.nutch
-
Adds metadata identifying the Creative Commons license used, if any.
- CCParseFilter() - Constructor for class org.creativecommons.nutch.CCParseFilter
-
- CCParseFilter.Walker - Class in org.creativecommons.nutch
-
Walks DOM tree, looking for RDF in comments and licenses in anchors.
- cdata(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of cdata.
- CHAR_ENCODING_FOR_CONVERSION - Static variable in interface org.apache.nutch.metadata.Nutch
-
- characters(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of character data.
- charactersRaw(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
If available, when the disable-output-escaping attribute is used,
output raw text without escaping.
- CHARSET_UTF8 - Static variable in class org.apache.nutch.parse.feed.FeedParser
-
- CHECK_BLOCKING - Static variable in interface org.apache.nutch.protocol.Protocol
-
Property name.
- CHECK_ROBOTS - Static variable in interface org.apache.nutch.protocol.Protocol
-
Property name.
- checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- checkOutputSpecs(FileSystem, JobConf) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
-
- checkOutputSpecs(FileSystem, JobConf) - Method in class org.apache.nutch.parse.ParseOutputFormat
-
- checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- childLen - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
-
- children - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
-
- childrenList - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
-
- chooseRepr(String, String, boolean) - Static method in class org.apache.nutch.util.URLUtil
-
Given two urls, a src and a destination of a redirect, it returns the
representative url.
- CircularDependencyException - Exception in org.apache.nutch.plugin
-
CircularDependencyException
will be thrown if a circular
dependency is detected.
- CircularDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
-
- CircularDependencyException(String) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
-
- cleanField(String) - Static method in class org.apache.nutch.util.StringUtil
-
Simple character substitution which cleans all � chars from a given String.
- CleaningJob - Class in org.apache.nutch.indexer
-
The class scans CrawlDB looking for entries with status DB_GONE (404) and
sends delete requests to indexers for those documents.
- CleaningJob() - Constructor for class org.apache.nutch.indexer.CleaningJob
-
- CleaningJob.DBFilter - Class in org.apache.nutch.indexer
-
- CleaningJob.DBFilter() - Constructor for class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- CleaningJob.DeleterReducer - Class in org.apache.nutch.indexer
-
- CleaningJob.DeleterReducer() - Constructor for class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- cleanMimeType(String) - Static method in class org.apache.nutch.util.MimeUtil
-
Cleans a MimeType
name by removing out the actual MimeType
,
from a string of the form:
- clear() - Method in class org.apache.nutch.crawl.Inlinks
-
- clear() - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- clear() - Method in class org.apache.nutch.metadata.Metadata
-
Remove all mappings from metadata.
- clearClues() - Method in class org.apache.nutch.util.EncodingDetector
-
Clears all clues.
- Client - Class in org.apache.nutch.protocol.ftp
-
Client.java encapsulates functionalities necessary for nutch to
get dir list and retrieve file from an FTP server.
- Client() - Constructor for class org.apache.nutch.protocol.ftp.Client
-
- clone() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- clone() - Method in class org.apache.nutch.indexer.NutchField
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbFilter
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- close(Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReducer
-
- close() - Method in class org.apache.nutch.crawl.Generator.Selector
-
- close() - Method in class org.apache.nutch.crawl.Injector.InjectMapper
-
- close() - Method in class org.apache.nutch.crawl.Injector.InjectReducer
-
- close() - Method in class org.apache.nutch.crawl.LinkDb
-
- close() - Method in class org.apache.nutch.crawl.LinkDbFilter
-
- close() - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- close() - Method in class org.apache.nutch.crawl.LinkDbReader
-
- close() - Method in class org.apache.nutch.crawl.URLPartitioner
-
- close() - Method in class org.apache.nutch.fetcher.Fetcher
-
- close() - Method in class org.apache.nutch.fetcher.OldFetcher
-
- close() - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- close() - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- close() - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- close() - Method in interface org.apache.nutch.indexer.IndexWriter
-
- close() - Method in class org.apache.nutch.indexer.IndexWriters
-
- close() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- close() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- close() - Method in class org.apache.nutch.parse.ParseSegment
-
- close() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
- close() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
- close() - Method in class org.apache.nutch.scoring.webgraph.LinkRank
-
- close() - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
- close() - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
- close() - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
- close() - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
- close() - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
- close() - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- close() - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
- close() - Method in class org.apache.nutch.segment.SegmentMerger
-
- close() - Method in class org.apache.nutch.segment.SegmentReader
-
- close() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Closes the record reader resources.
- close() - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- close() - Method in class org.apache.nutch.tools.CrawlDBScanner
-
- closeReaders(SequenceFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
-
Closes a group of SequenceFile readers.
- closeReaders(MapFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
-
Closes a group of MapFile readers.
- CollectionManager - Class in org.apache.nutch.collection
-
- CollectionManager(Configuration) - Constructor for class org.apache.nutch.collection.CollectionManager
-
- CollectionManager() - Constructor for class org.apache.nutch.collection.CollectionManager
-
Used for testing
- CommandRunner - Class in org.apache.nutch.util
-
- CommandRunner() - Constructor for class org.apache.nutch.util.CommandRunner
-
- comment(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report an XML comment anywhere in the document.
- commit() - Method in interface org.apache.nutch.indexer.IndexWriter
-
- commit() - Method in class org.apache.nutch.indexer.IndexWriters
-
- commit() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- COMMIT_INDEX - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- COMMIT_INDEX - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
Deprecated.
- COMMIT_SIZE - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- COMMIT_SIZE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.CrawlDatum.Comparator
-
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
-
Compares two FloatWritables decreasing.
- compare(WritableComparable, WritableComparable) - Method in class org.apache.nutch.crawl.Generator.HashComparator
-
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.HashComparator
-
- compare(Object, Object) - Method in class org.apache.nutch.crawl.SignatureComparator
-
- compareTo(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Sort by decreasing score.
- compareTo(TrieStringMatcher.TrieNode) - Method in class org.apache.nutch.util.TrieStringMatcher.TrieNode
-
- conf - Variable in class org.apache.nutch.crawl.Signature
-
- conf - Variable in class org.apache.nutch.plugin.Plugin
-
- conf - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbFilter
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Generator.Selector
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.LinkDb
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.LinkDbFilter
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.URLPartitioner
-
- configure(JobConf) - Method in class org.apache.nutch.fetcher.Fetcher
-
- configure(JobConf) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- configure(JobConf) - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- configure(JobConf) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- configure(JobConf) - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- configure(JobConf) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- configure(JobConf) - Method in class org.apache.nutch.parse.ParseSegment
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Configures the job.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Configure the job.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Configure the job.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
Configures the job, sets the flag for type of content and the topN number
if any.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
Configures the OutlinkDb job.
- configure(JobConf) - Method in class org.apache.nutch.segment.SegmentMerger
-
- configure(JobConf) - Method in class org.apache.nutch.segment.SegmentReader
-
- configure(JobConf) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Configures the job.
- configure(JobConf) - Method in class org.apache.nutch.tools.CrawlDBScanner
-
- configure(JobConf) - Method in class org.apache.nutch.tools.FreeGenerator.FG
-
- containsKey(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- containsValue(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- Content - Class in org.apache.nutch.protocol
-
- Content() - Constructor for class org.apache.nutch.protocol.Content
-
- Content(String, String, byte[], String, Metadata, Configuration) - Constructor for class org.apache.nutch.protocol.Content
-
- CONTENT_DISPOSITION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_LANGUAGE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_LENGTH - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_MD5 - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- CONTENT_REDIR - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- CONTENT_TYPE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- ContentAsTextInputFormat - Class in org.apache.nutch.segment
-
An input format that takes Nutch Content objects and converts them to text
while converting newline endings to spaces.
- ContentAsTextInputFormat() - Constructor for class org.apache.nutch.segment.ContentAsTextInputFormat
-
- CONTRIBUTOR - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity responsible for making contributions to the content of the
resource.
- COVERAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
The extent or scope of the content of the resource.
- Crawl - Class in org.apache.nutch.crawl
-
- Crawl() - Constructor for class org.apache.nutch.crawl.Crawl
-
- CrawlDatum - Class in org.apache.nutch.crawl
-
- CrawlDatum() - Constructor for class org.apache.nutch.crawl.CrawlDatum
-
- CrawlDatum(int, int) - Constructor for class org.apache.nutch.crawl.CrawlDatum
-
- CrawlDatum(int, int, float) - Constructor for class org.apache.nutch.crawl.CrawlDatum
-
- CrawlDatum.Comparator - Class in org.apache.nutch.crawl
-
A Comparator optimized for CrawlDatum.
- CrawlDatum.Comparator() - Constructor for class org.apache.nutch.crawl.CrawlDatum.Comparator
-
- CrawlDb - Class in org.apache.nutch.crawl
-
This class takes the output of the fetcher and updates the
crawldb accordingly.
- CrawlDb() - Constructor for class org.apache.nutch.crawl.CrawlDb
-
- CrawlDb(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDb
-
- CRAWLDB_ADDITIONS_ALLOWED - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- CRAWLDB_PURGE_404 - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- CrawlDbFilter - Class in org.apache.nutch.crawl
-
This class provides a way to separate the URL normalization
and filtering steps from the rest of CrawlDb manipulation code.
- CrawlDbFilter() - Constructor for class org.apache.nutch.crawl.CrawlDbFilter
-
- CrawlDbMerger - Class in org.apache.nutch.crawl
-
This tool merges several CrawlDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited
pages.
- CrawlDbMerger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
-
- CrawlDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
-
- CrawlDbMerger.Merger - Class in org.apache.nutch.crawl
-
- CrawlDbMerger.Merger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- CrawlDbReader - Class in org.apache.nutch.crawl
-
Read utility for the CrawlDB.
- CrawlDbReader() - Constructor for class org.apache.nutch.crawl.CrawlDbReader
-
- CrawlDbReader.CrawlDatumCsvOutputFormat - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDatumCsvOutputFormat() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
-
- CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter(DataOutputStream) - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
-
- CrawlDbReader.CrawlDbDumpMapper - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbDumpMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- CrawlDbReader.CrawlDbStatCombiner - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbStatCombiner() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- CrawlDbReader.CrawlDbStatMapper - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbStatMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- CrawlDbReader.CrawlDbStatReducer - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbStatReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- CrawlDbReader.CrawlDbTopNMapper - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbTopNMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- CrawlDbReader.CrawlDbTopNReducer - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbTopNReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- CrawlDbReducer - Class in org.apache.nutch.crawl
-
Merge new page entries with existing entries.
- CrawlDbReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReducer
-
- CrawlDBScanner - Class in org.apache.nutch.tools
-
Dumps all the entries matching a regular expression on their URL.
- CrawlDBScanner() - Constructor for class org.apache.nutch.tools.CrawlDBScanner
-
- CrawlDBScanner(Configuration) - Constructor for class org.apache.nutch.tools.CrawlDBScanner
-
- create() - Static method in class org.apache.nutch.util.NutchConfiguration
-
Create a Configuration
for Nutch.
- create(boolean, Properties) - Static method in class org.apache.nutch.util.NutchConfiguration
-
Create a Configuration
from supplied properties.
- createJob(Configuration, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
-
- createKey() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Creates a new instance of the Text
object for the key.
- createLockFile(FileSystem, Path, boolean) - Static method in class org.apache.nutch.util.LockUtil
-
Create a lock file.
- createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
-
- createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.LinkDbMerger
-
- createParseResult(String, Parse) - Static method in class org.apache.nutch.parse.ParseResult
-
Convenience method for obtaining
ParseResult
from a single
Parse
output.
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
-
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- createSegments(Path, Path) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Creates the arc files to segments job.
- createSocket(String, int, InetAddress, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
- createSocket(String, int, InetAddress, int, HttpConnectionParams) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
Attempts to get a new socket connection to the given host within the given
time limit.
- createSocket(String, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
- createSocket(Socket, String, int, boolean) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
- createSubCollection(String, String) - Method in class org.apache.nutch.collection.CollectionManager
-
Create a new subcollection.
- createValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Creates a new instance of the BytesWritable
object for the key
- createWebGraph(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
-
Creates the three different WebGraph databases, Outlinks, Inlinks, and
Node.
- CreativeCommons - Interface in org.apache.nutch.metadata
-
A collection of Creative Commons properties names.
- CREATOR - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity primarily responsible for making the content of the resource.
- CURRENT_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- CURRENT_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
-
- DATE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A date associated with an event in the life cycle of the resource.
- dateFormatStr - Static variable in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- datum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
-
- debug - Variable in class org.apache.nutch.tools.proxy.AbstractTestbedHandler
-
- DEC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- dedup(String) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- dedup(String, boolean) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- DEFAULT_BOOST - Static variable in class org.apache.nutch.util.domain.DomainSuffix
-
- DEFAULT_DELAY - Static variable in class org.apache.nutch.tools.proxy.DelayHandler
-
- DEFAULT_FILE_NAME - Static variable in class org.apache.nutch.collection.CollectionManager
-
- DEFAULT_PLUGIN - Static variable in class org.apache.nutch.parse.ParserFactory
-
Wildcard for default plugins.
- DEFAULT_STATUS - Static variable in class org.apache.nutch.util.domain.DomainSuffix
-
- DefaultFetchSchedule - Class in org.apache.nutch.crawl
-
This class implements the default re-fetch schedule.
- DefaultFetchSchedule() - Constructor for class org.apache.nutch.crawl.DefaultFetchSchedule
-
- defaultInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
-
- deflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns a deflated copy of the input array.
- DeflateUtils - Class in org.apache.nutch.util
-
A collection of utility methods for working on deflated data.
- DeflateUtils() - Constructor for class org.apache.nutch.util.DeflateUtils
-
- DelayHandler - Class in org.apache.nutch.tools.proxy
-
- DelayHandler(int) - Constructor for class org.apache.nutch.tools.proxy.DelayHandler
-
- delete(String, boolean) - Method in class org.apache.nutch.indexer.CleaningJob
-
- delete(String) - Method in interface org.apache.nutch.indexer.IndexWriter
-
- delete(String) - Method in class org.apache.nutch.indexer.IndexWriters
-
- DELETE - Static variable in class org.apache.nutch.indexer.NutchIndexAction
-
- delete(String) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- deleteSubCollection(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Delete named subcollection
- describe() - Method in interface org.apache.nutch.indexer.IndexWriter
-
Returns a String describing the IndexWriter instance and the specific parameters it can take
- describe() - Method in class org.apache.nutch.indexer.IndexWriters
-
- describe() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- DESCRIPTION - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An account of the content of the resource.
- DIGEST_FIELD - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- DIR_NAME - Static variable in class org.apache.nutch.parse.ParseData
-
- DIR_NAME - Static variable in class org.apache.nutch.parse.ParseText
-
- DIR_NAME - Static variable in class org.apache.nutch.protocol.Content
-
- disconnect() - Method in class org.apache.nutch.protocol.ftp.Client
-
Closes the connection to the FTP server and restores
connection parameters to the default values.
- distributeScoreToOutlink(Text, Text, ParseData, CrawlDatum, CrawlDatum, int, int) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Distribute score value from the current page to all its outlinked pages.
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
- DmozParser - Class in org.apache.nutch.tools
-
Utility that converts DMOZ RDF into a flat file of URLs to be injected.
- DmozParser() - Constructor for class org.apache.nutch.tools.DmozParser
-
- doc - Variable in class org.apache.nutch.indexer.NutchIndexAction
-
- doFilter(ServletRequest, ServletResponse, FilterChain) - Method in class org.apache.nutch.tools.proxy.LogDebugHandler
-
- DomainBlacklistURLFilter - Class in org.apache.nutch.urlfilter.domainblacklist
-
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
- DomainBlacklistURLFilter() - Constructor for class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
Default constructor.
- DomainBlacklistURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
Constructor that specifies the domain file to use.
- DomainStatistics - Class in org.apache.nutch.util.domain
-
Extracts some very basic statistics about domains from the crawldb
- DomainStatistics() - Constructor for class org.apache.nutch.util.domain.DomainStatistics
-
- DomainStatistics.DomainStatisticsCombiner - Class in org.apache.nutch.util.domain
-
- DomainStatistics.DomainStatisticsCombiner() - Constructor for class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
-
- DomainStatistics.MyCounter - Enum in org.apache.nutch.util.domain
-
- DomainSuffix - Class in org.apache.nutch.util.domain
-
This class represents the last part of the host name,
which is operated by authoritives, not individuals.
- DomainSuffix(String, DomainSuffix.Status, float) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
-
- DomainSuffix(String) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
-
- DomainSuffix.Status - Enum in org.apache.nutch.util.domain
-
Enumeration of the status of the tld.
- DomainSuffixes - Class in org.apache.nutch.util.domain
-
Storage class for DomainSuffix
objects
Note: this class is singleton
- DomainURLFilter - Class in org.apache.nutch.urlfilter.domain
-
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
- DomainURLFilter() - Constructor for class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
Default constructor.
- DomainURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
Constructor that specifies the domain file to use.
- DOMBuilder - Class in org.apache.nutch.parse.html
-
This class takes SAX events (in addition to some extra events
that SAX doesn't handle yet) and adds the result to a document
or document fragment.
- DOMBuilder(Document, Node) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMBuilder(Document, DocumentFragment) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMBuilder(Document) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMContentUtils - Class in org.apache.nutch.parse.html
-
A collection of methods for extracting content from DOM trees.
- DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils
-
- DOMContentUtils - Class in org.apache.nutch.parse.tika
-
A collection of methods for extracting content from DOM trees.
- DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.tika.DOMContentUtils
-
- DOMContentUtils.LinkParams - Class in org.apache.nutch.parse.html
-
- DOMContentUtils.LinkParams(String, String, int) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
-
- DomUtil - Class in org.apache.nutch.util
-
- DomUtil() - Constructor for class org.apache.nutch.util.DomUtil
-
- DublinCore - Interface in org.apache.nutch.metadata
-
A collection of Dublin Core metadata names.
- DummySSLProtocolSocketFactory - Class in org.apache.nutch.protocol.httpclient
-
- DummySSLProtocolSocketFactory() - Constructor for class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
Constructor for DummySSLProtocolSocketFactory.
- DummyX509TrustManager - Class in org.apache.nutch.protocol.httpclient
-
- DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
Constructor for DummyX509TrustManager.
- dump(Path, Path) - Method in class org.apache.nutch.segment.SegmentReader
-
- DUMP_DIR - Static variable in class org.apache.nutch.scoring.webgraph.LinkDumper
-
- dumpLinks(Path) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
Runs the inverter and merger jobs of the LinkDumper tool to create the
url to inlink node database.
- dumpNodes(Path, NodeDumper.DumpType, long, Path, boolean, NodeDumper.NameType, NodeDumper.AggrType, boolean) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
Runs the process to dump the top urls out to a text file.
- dumpUrl(Path, String) - Method in class org.apache.nutch.scoring.webgraph.LoopReader
-
Prints loopset for a single url.
- dumpUrl(Path, String) - Method in class org.apache.nutch.scoring.webgraph.NodeReader
-
Prints the content of the Node represented by the url to system out.
- FAILED - Static variable in class org.apache.nutch.parse.ParseStatus
-
General failure.
- FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Content was not retrieved.
- FAILED_EXCEPTION - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_INVALID_FORMAT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_MISSING_CONTENT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_MISSING_PARTS - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FakeHandler - Class in org.apache.nutch.tools.proxy
-
- FakeHandler() - Constructor for class org.apache.nutch.tools.proxy.FakeHandler
-
- Feed - Interface in org.apache.nutch.metadata
-
A collection of Feed property names extracted by the ROME library.
- FEED - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_AUTHOR - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_PUBLISHED - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_TAGS - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_UPDATED - Static variable in interface org.apache.nutch.metadata.Feed
-
- FeedIndexingFilter - Class in org.apache.nutch.indexer.feed
-
- FeedIndexingFilter() - Constructor for class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- FeedParser - Class in org.apache.nutch.parse.feed
-
- FeedParser() - Constructor for class org.apache.nutch.parse.feed.FeedParser
-
- fetch(Path, int) - Method in class org.apache.nutch.fetcher.Fetcher
-
- fetch(Path, int) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- FETCH_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- FETCH_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- FETCH_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- fetched - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- Fetcher - Class in org.apache.nutch.fetcher
-
A queue-based fetcher.
- Fetcher() - Constructor for class org.apache.nutch.fetcher.Fetcher
-
- Fetcher(Configuration) - Constructor for class org.apache.nutch.fetcher.Fetcher
-
- Fetcher.InputFormat - Class in org.apache.nutch.fetcher
-
- Fetcher.InputFormat() - Constructor for class org.apache.nutch.fetcher.Fetcher.InputFormat
-
- FetcherOutputFormat - Class in org.apache.nutch.fetcher
-
Splits FetcherOutput entries into multiple map files.
- FetcherOutputFormat() - Constructor for class org.apache.nutch.fetcher.FetcherOutputFormat
-
- fetchErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- FetchSchedule - Interface in org.apache.nutch.crawl
-
This interface defines the contract for implementations that manipulate
fetch times and re-fetch intervals.
- FetchScheduleFactory - Class in org.apache.nutch.crawl
-
- FIELD - Static variable in class org.creativecommons.nutch.CCIndexingFilter
-
The name of the document field we use.
- fieldName - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
Doc field name
- File - Class in org.apache.nutch.protocol.file
-
This class is a protocol plugin used for file: scheme.
- File() - Constructor for class org.apache.nutch.protocol.file.File
-
- FileError - Exception in org.apache.nutch.protocol.file
-
Thrown for File error codes.
- FileError(int) - Constructor for exception org.apache.nutch.protocol.file.FileError
-
- FileException - Exception in org.apache.nutch.protocol.file
-
- FileException() - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- FileException(String) - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- FileException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- FileException(Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- fileLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- FileResponse - Class in org.apache.nutch.protocol.file
-
FileResponse.java mimics file replies as http response.
- FileResponse(URL, CrawlDatum, File, Configuration) - Constructor for class org.apache.nutch.protocol.file.FileResponse
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
Scan the HTML document looking at possible indications of content
language
1.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
- filter(String) - Method in class org.apache.nutch.collection.Subcollection
-
Simple "indexOf" currentFilter for matching patterns.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
The
AnchorIndexingFilter
filter object which supports boolean
configuration settings for the deduplication of anchors.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
The
BasicIndexingFilter
filter object which supports few
configuration settings for adding basic searchable fields.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
Extracts out the relevant fields:
FEED_AUTHOR
FEED_TAGS
FEED_PUBLISHED
FEED_UPDATED
FEED
And sends them to the Indexer
for indexing within the Nutch
index.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in interface org.apache.nutch.indexer.IndexingFilter
-
Adds fields or otherwise modifies the document that will be indexed for a
parse.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.IndexingFilters
-
Run all defined filters.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
-
Scan the HTML document looking at possible rel-tags
- filter(String) - Method in interface org.apache.nutch.net.URLFilter
-
- filter(String) - Method in class org.apache.nutch.net.URLFilters
-
Run all defined filters.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in interface org.apache.nutch.parse.HtmlParseFilter
-
Adds metadata or otherwise modifies a parse of HTML content, given
the DOM tree of a page.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.HtmlParseFilters
-
Run all defined filters.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.MetaTagsParser
-
- filter() - Method in class org.apache.nutch.parse.ParseResult
-
Remove all results where status is not successful (as determined
by ParseStatus#isSuccess()).
- filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in interface org.apache.nutch.segment.SegmentMergeFilter
-
The filtering method which gets all information being merged for a given
key (URL).
- filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in class org.apache.nutch.segment.SegmentMergeFilters
-
Iterates over all
SegmentMergeFilter
extensions and if any of them
returns false, it will return false as well.
- filter(String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- filter(String) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.creativecommons.nutch.CCIndexingFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.creativecommons.nutch.CCParseFilter
-
Adds metadata or otherwise modifies a parse of an HTML document, given
the DOM tree of a page.
- filterNormalize(String, String, String, boolean, URLFilters, URLNormalizers) - Static method in class org.apache.nutch.parse.ParseOutputFormat
-
- finalize() - Method in class org.apache.nutch.plugin.Plugin
-
- finalize() - Method in class org.apache.nutch.plugin.PluginRepository
-
- finalize() - Method in class org.apache.nutch.protocol.ftp.Ftp
-
- findAuthentication(Metadata) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- findLoops(Path) - Method in class org.apache.nutch.scoring.webgraph.Loops
-
Runs the various loop jobs.
- FIXED_INTERVAL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
Used by AdaptiveFetchSchedule to maintain custom fetch interval
- FORBID_ALL_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
A BaseRobotRules
object appropriate for use when the
robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.
- forceRefetch(Text, CrawlDatum, boolean) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method resets fetchTime, fetchInterval, modifiedTime,
retriesSinceFetch and page signature, so that it forces refetching.
- forceRefetch(Text, CrawlDatum, boolean) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method resets fetchTime, fetchInterval, modifiedTime and
page signature, so that it forces refetching.
- FORMAT - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Typically, Format may include the media-type or dimensions of the
resource.
- format - Static variable in class org.apache.nutch.net.protocols.HttpDateFormat
-
- forName(String) - Method in class org.apache.nutch.util.MimeUtil
-
A facade interface to Tika's underlying MimeTypes.forName(String)
method.
- FreeGenerator - Class in org.apache.nutch.tools
-
This tool generates fetchlists (segments to be fetched) from plain text
files containing one URL per line.
- FreeGenerator() - Constructor for class org.apache.nutch.tools.FreeGenerator
-
- FreeGenerator.FG - Class in org.apache.nutch.tools
-
- FreeGenerator.FG() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG
-
- fromHexString(String) - Static method in class org.apache.nutch.util.StringUtil
-
Convert a String containing consecutive (no inside whitespace) hexadecimal
digits into a corresponding byte array.
- FSUtils - Class in org.apache.nutch.util
-
Utility methods for common filesystem operations.
- FSUtils() - Constructor for class org.apache.nutch.util.FSUtils
-
- Ftp - Class in org.apache.nutch.protocol.ftp
-
This class is a protocol plugin used for ftp: scheme.
- Ftp() - Constructor for class org.apache.nutch.protocol.ftp.Ftp
-
- FtpError - Exception in org.apache.nutch.protocol.ftp
-
Thrown for Ftp error codes.
- FtpError(int) - Constructor for exception org.apache.nutch.protocol.ftp.FtpError
-
- FtpException - Exception in org.apache.nutch.protocol.ftp
-
Superclass for important exceptions thrown during FTP talk,
that must be handled with care.
- FtpException() - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpException(String) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpException(Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpExceptionBadSystResponse - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating bad reply of SYST command.
- FtpExceptionCanNotHaveDataConnection - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating failure of opening data connection.
- FtpExceptionControlClosedByForcedDataClose - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating control channel is closed by server end, due to
forced closure of data channel at client (our) end.
- FtpExceptionUnknownForcedDataClose - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating unrecognizable reply from server after
forced closure of data channel by client (our) side.
- FtpResponse - Class in org.apache.nutch.protocol.ftp
-
FtpResponse.java mimics ftp replies as http response.
- FtpResponse(URL, CrawlDatum, Ftp, Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpResponse
-
- FtpRobotRulesParser - Class in org.apache.nutch.protocol.ftp
-
This class is used for parsing robots for urls belonging to FTP protocol.
- FtpRobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
- generate(Path, Path, int, long, long) - Method in class org.apache.nutch.crawl.Generator
-
- generate(Path, Path, int, long, long, boolean, boolean) - Method in class org.apache.nutch.crawl.Generator
-
old signature used for compatibility - does not specify whether or not to
normalise and set the number of segments to 1
- generate(Path, Path, int, long, long, boolean, boolean, boolean, int) - Method in class org.apache.nutch.crawl.Generator
-
Generate fetchlists in one or more segments.
- GENERATE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- GENERATE_MAX_PER_HOST_BY_IP - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATE_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- GENERATE_UPDATE_CRAWLDB - Static variable in class org.apache.nutch.crawl.Generator
-
- generated - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- generateFileNameForKeyValue(FloatWritable, Generator.SelectorEntry, String) - Method in class org.apache.nutch.crawl.Generator.GeneratorOutputFormat
-
- generateSegmentName() - Static method in class org.apache.nutch.crawl.Generator
-
- generateSegmentName() - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Generates a random name for the segments.
- Generator - Class in org.apache.nutch.crawl
-
Generates a subset of a crawl db to fetch.
- Generator() - Constructor for class org.apache.nutch.crawl.Generator
-
- Generator(Configuration) - Constructor for class org.apache.nutch.crawl.Generator
-
- Generator.CrawlDbUpdater - Class in org.apache.nutch.crawl
-
Update the CrawlDB so that the next generate won't include the same URLs.
- Generator.CrawlDbUpdater() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- Generator.DecreasingFloatComparator - Class in org.apache.nutch.crawl
-
- Generator.DecreasingFloatComparator() - Constructor for class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
-
- Generator.GeneratorOutputFormat - Class in org.apache.nutch.crawl
-
- Generator.GeneratorOutputFormat() - Constructor for class org.apache.nutch.crawl.Generator.GeneratorOutputFormat
-
- Generator.HashComparator - Class in org.apache.nutch.crawl
-
Sort fetch lists by hash of URL.
- Generator.HashComparator() - Constructor for class org.apache.nutch.crawl.Generator.HashComparator
-
- Generator.PartitionReducer - Class in org.apache.nutch.crawl
-
- Generator.PartitionReducer() - Constructor for class org.apache.nutch.crawl.Generator.PartitionReducer
-
- Generator.Selector - Class in org.apache.nutch.crawl
-
Selects entries due for fetch.
- Generator.Selector() - Constructor for class org.apache.nutch.crawl.Generator.Selector
-
- Generator.SelectorEntry - Class in org.apache.nutch.crawl
-
- Generator.SelectorEntry() - Constructor for class org.apache.nutch.crawl.Generator.SelectorEntry
-
- Generator.SelectorInverseMapper - Class in org.apache.nutch.crawl
-
- Generator.SelectorInverseMapper() - Constructor for class org.apache.nutch.crawl.Generator.SelectorInverseMapper
-
- GENERATOR_COUNT_MODE - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_COUNT_VALUE_DOMAIN - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_COUNT_VALUE_HOST - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_CUR_TIME - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_DELAY - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_FILTER - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MAX_COUNT - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MAX_NUM_SEGMENTS - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MIN_INTERVAL - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MIN_SCORE - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_NORMALISE - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_RESTRICT_STATUS - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_TOP_N - Static variable in class org.apache.nutch.crawl.Generator
-
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- generatorSortValue(Text, CrawlDatum, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method prepares a sort value for the purpose of sorting and
selecting top N scoring pages during fetchlist generation.
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a sort value for Generate.
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- GenericWritableConfigurable - Class in org.apache.nutch.util
-
A generic Writable wrapper that can inject Configuration to Configurable
s
- GenericWritableConfigurable() - Constructor for class org.apache.nutch.util.GenericWritableConfigurable
-
- get(String, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- get(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- get(String) - Method in class org.apache.nutch.metadata.Metadata
-
Get the value associated to a metadata name.
- get(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- get(String) - Method in class org.apache.nutch.parse.ParseResult
-
Retrieve a single parse output.
- get(Text) - Method in class org.apache.nutch.parse.ParseResult
-
Retrieve a single parse output.
- get(Configuration) - Static method in class org.apache.nutch.plugin.PluginRepository
-
- get(FileSplit) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a FileSplit.
- get(String) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a full path of a location inside any segment part.
- get(Path, Text, Writer, Map<String, List<Writable>>) - Method in class org.apache.nutch.segment.SegmentReader
-
- get(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
-
Return the
DomainSuffix
object for the extension, if
extension is a top level domain returned object will be an
instance of
TopLevelDomain
- get(Configuration) - Static method in class org.apache.nutch.util.ObjectCache
-
- getAccept() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getAcceptedIssuers() - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- getAcceptLanguage() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
Value of "Accept-Language" request header sent by Nutch.
- getAll() - Method in class org.apache.nutch.collection.CollectionManager
-
Returns all collections
- getAnchor() - Method in class org.apache.nutch.crawl.Inlink
-
- getAnchor() - Method in class org.apache.nutch.parse.Outlink
-
- getAnchor() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getAnchors() - Method in class org.apache.nutch.crawl.Inlinks
-
Return the set of anchor texts.
- getAnchors(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- getArgs() - Method in class org.apache.nutch.parse.ParseStatus
-
- getArgs() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getAttribute(String) - Method in class org.apache.nutch.plugin.Extension
-
Returns a attribute value, that is setuped in the manifest file and is
definied by the extension point xml schema.
- getAuthentication(String, Configuration) - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
This method is responsible for providing Basic authentication information.
- getBase(Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
If Node contains a BASE tag then it's HREF is returned.
- getBaseHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getBaseUrl() - Method in class org.apache.nutch.protocol.Content
-
The base url for relative links contained in the content.
- getBasicPattern() - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Provides a pattern which can be used by an outside resource to determine if
this class can provide credentials based on simple header information.
- getBlackListString() - Method in class org.apache.nutch.collection.Subcollection
-
Returns blacklist String
- getBoost() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- getBoost() - Method in class org.apache.nutch.util.domain.DomainSuffix
-
- getBufferSize() - Method in class org.apache.nutch.protocol.ftp.Ftp
-
- getClassLoader() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a cached classloader for a plugin.
- getClazz() - Method in class org.apache.nutch.plugin.Extension
-
Returns the full class name of the extension point implementation
- getCode() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the response code.
- getCode(int) - Method in exception org.apache.nutch.protocol.file.FileError
-
- getCode() - Method in class org.apache.nutch.protocol.file.FileResponse
-
Returns the response code.
- getCode(int) - Method in exception org.apache.nutch.protocol.ftp.FtpError
-
- getCode() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
Returns the response code.
- getCode() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getCode() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getCode() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getCollectionManager(Configuration) - Static method in class org.apache.nutch.collection.CollectionManager
-
- getCommand() - Method in class org.apache.nutch.util.CommandRunner
-
- getCommonsHttpSolrServer(JobConf) - Static method in class org.apache.nutch.indexer.solr.SolrUtils
-
- getCommonsHttpSolrServer(JobConf) - Static method in class org.apache.nutch.indexwriter.solr.SolrUtils
-
- getConf() - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- getConf() - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
- getConf() - Method in class org.apache.nutch.crawl.Signature
-
- getConf() - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.indexer.CleaningJob
-
- getConf() - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- getConf() - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- getConf() - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- getConf() - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
Boilerplate
- getConf() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagParser
-
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
-
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
-
- getConf() - Method in class org.apache.nutch.parse.ext.ExtParser
-
- getConf() - Method in class org.apache.nutch.parse.feed.FeedParser
-
- getConf() - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
- getConf() - Method in class org.apache.nutch.parse.html.HtmlParser
-
- getConf() - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- getConf() - Method in class org.apache.nutch.parse.MetaTagsParser
-
- getConf() - Method in class org.apache.nutch.parse.ParserChecker
-
- getConf() - Method in class org.apache.nutch.parse.swf.SWFParser
-
- getConf() - Method in class org.apache.nutch.parse.tika.TikaParser
-
- getConf() - Method in class org.apache.nutch.parse.zip.ZipParser
-
- getConf() - Method in class org.apache.nutch.protocol.file.File
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
- getConf() - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- getConf() - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- getConf() - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- getConf() - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- getConf() - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- getConf() - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
-
- getConf() - Method in class org.apache.nutch.util.GenericWritableConfigurable
-
- getConf() - Method in class org.creativecommons.nutch.CCIndexingFilter
-
- getConf() - Method in class org.creativecommons.nutch.CCParseFilter
-
- getContent() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the full content of the response.
- getContent() - Method in class org.apache.nutch.protocol.Content
-
The binary content retrieved.
- getContent() - Method in class org.apache.nutch.protocol.file.FileResponse
-
- getContent() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
- getContent() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getContent() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getContent() - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- getContentMeta() - Method in class org.apache.nutch.parse.ParseData
-
The original Metadata retrieved from content
- getContentType() - Method in exception org.apache.nutch.parse.ParserNotFound
-
- getContentType() - Method in class org.apache.nutch.protocol.Content
-
The media type of the retrieved content.
- getCopyMap() - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getCountryName() - Method in class org.apache.nutch.util.domain.TopLevelDomain
-
Returns the country name if TLD is Country Code TLD
- getCrawlDelay() - Method in interface org.apache.nutch.protocol.RobotRules
-
Get Crawl-Delay, in milliseconds.
- getCredentials() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
-
Gets the credentials generated by the HttpAuthentication
object.
- getCredentials() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Gets the Basic credentials generated by this
HttpBasicAuthentication object
- getCurrentNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Get the node currently being processed.
- getData() - Method in interface org.apache.nutch.parse.Parse
-
Other data extracted from the page.
- getData() - Method in class org.apache.nutch.parse.ParseImpl
-
- getDependencies() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of plugin ids.
- getDescriptor() - Method in class org.apache.nutch.plugin.Extension
-
return the plugin descriptor.
- getDescriptor() - Method in class org.apache.nutch.plugin.Plugin
-
Returns the plugin descriptor
- getDocBegin() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- getDocumentMeta() - Method in class org.apache.nutch.indexer.NutchDocument
-
- getDom(InputStream) - Static method in class org.apache.nutch.util.DomUtil
-
Returns parsed dom tree or null if any error
- getDomain() - Method in class org.apache.nutch.util.domain.DomainSuffix
-
- getDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the domain name of the url.
- getDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the domain name of the url.
- getDomainSuffix(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the
DomainSuffix
corresponding to the
last public part of the hostname
- getDomainSuffix(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the
DomainSuffix
corresponding to the
last public part of the hostname
- getElement(String) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
Finds the specified element and returns its value
- getEmptyParse(Configuration) - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- getEmptyParseResult(String, Configuration) - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- getExitValue() - Method in class org.apache.nutch.util.CommandRunner
-
- getExpireTime() - Method in interface org.apache.nutch.protocol.RobotRules
-
Get expire time
- getExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array exported librareis as URLs
- getExtensionInstance() - Method in class org.apache.nutch.plugin.Extension
-
Return an instance of the extension implementatio.
- getExtensionPoint(String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns a extension point indentified by a extension point id.
- getExtensions(String) - Method in class org.apache.nutch.parse.ParserFactory
-
Finds the best-suited parse plugin for a given contentType.
- getExtensions() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns a array of extensions that lsiten to this extension point
- getExtensions() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns an array of extensions.
- getExtenstionPoints() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of extension points.
- getFetchInterval() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getFetchSchedule(Configuration) - Static method in class org.apache.nutch.crawl.FetchScheduleFactory
-
Return the FetchSchedule implementation.
- getFetchTime() - Method in class org.apache.nutch.crawl.CrawlDatum
-
Returns either the time of the last fetch, or the next fetch time,
depending on whether Fetcher or CrawlDbReducer set the time.
- getField(String) - Method in class org.apache.nutch.indexer.NutchDocument
-
- getFieldNames() - Method in class org.apache.nutch.indexer.NutchDocument
-
- getFieldValue(String) - Method in class org.apache.nutch.indexer.NutchDocument
-
- getFromUrl() - Method in class org.apache.nutch.crawl.Inlink
-
- getGeneralTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Returns all collected values of the general meta tags.
- getHeader(String) - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.file.FileResponse
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getHeader(String) - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getHeaders() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns all the headers.
- getHeaders() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getHeaders() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getHost(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the lowercased hostname for the url or null if the url is not well
formed.
- getHostSegments(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Partitions of the hostname of the url by "."
- getHostSegments(String) - Static method in class org.apache.nutch.util.URLUtil
-
Partitions of the hostname of the url by "."
- getHttpEquivTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Returns all collected values of the "http-equiv" meta tags.
- getId() - Method in class org.apache.nutch.collection.Subcollection
-
- getId() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- getId() - Method in class org.apache.nutch.plugin.Extension
-
Return the unique id of the extension.
- getId() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns the unique id of the extension point.
- getInlinks(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- getInlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getInstance(Configuration) - Static method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getInstance() - Static method in class org.apache.nutch.util.domain.DomainSuffixes
-
Singleton instance, lazy instantination
- getKey() - Method in class org.apache.nutch.collection.Subcollection
-
- getKeyMap() - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getLastModified() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getLength() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- getLinks() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- getLinkType() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getLocations() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- getLookingFor() - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- getLoopSet() - Method in class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- getMajorCode() - Method in class org.apache.nutch.parse.ParseStatus
-
- getMaxContent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getMessage() - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- getMessage() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getMeta(String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get metadata.
- getMeta(String) - Method in class org.apache.nutch.parse.ParseData
-
Get a metadata single value.
- getMetaData() - Method in class org.apache.nutch.crawl.CrawlDatum
-
returns a MapWritable if it was set or read in @see readFields(DataInput),
returns empty map in case CrawlDatum was freshly created (lazily instantiated).
- getMetadata() - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get all metadata.
- getMetadata() - Method in class org.apache.nutch.protocol.Content
-
Other protocol-specific data.
- getMetadata() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.html.HTMLMetaProcessor
-
Sets the indicators in robotsMeta
to appropriate
values, based on any META tags found under the given
node
.
- getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.tika.HTMLMetaProcessor
-
Sets the indicators in robotsMeta
to appropriate
values, based on any META tags found under the given
node
.
- getMetaValues(String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get multiple metadata.
- getMimeType(String) - Method in class org.apache.nutch.util.MimeUtil
-
Facade interface to Tika's underlying MimeTypes.getMimeType(String)
method.
- getMimeType(File) - Method in class org.apache.nutch.util.MimeUtil
-
Facade interface to Tika's underlying MimeTypes.getMimeType(File)
method.
- getMinorCode() - Method in class org.apache.nutch.parse.ParseStatus
-
- getModifiedTime() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getName() - Method in class org.apache.nutch.collection.Subcollection
-
- getName() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns the name of the extension point.
- getName() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the name of the plugin.
- getName() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getNode() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- getNodeValue(Node) - Static method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
Returns the text value of the specified Node and child nodes
- getNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getNormalizedName(String) - Static method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
Get the normalized name of metadata attribute name.
- getNotExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of libraries as URLs that are not exported by the plugin.
- getNumDocs() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- getNumInlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getNumOutlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getObject(String) - Method in class org.apache.nutch.util.ObjectCache
-
- getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method finds all anchors below the supplied DOM
node
, and creates appropriate
Outlink
records for each (relative to the supplied
base
URL), and adds them to the
outlinks
ArrayList
.
- getOutlinks(String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
-
Extracts Outlink
from given plain text.
- getOutlinks(String, String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
-
Extracts Outlink
from given plain text and adds anchor
to the extracted Outlink
s
- getOutlinks() - Method in class org.apache.nutch.parse.ParseData
-
The outlinks of the page.
- getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
This method finds all anchors below the supplied DOM
node
, and creates appropriate
Outlink
records for each (relative to the supplied
base
URL), and adds them to the
outlinks
ArrayList
.
- getOutlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getOutlinkUrl() - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- getPage(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the page for the url.
- getParse(Content) - Method in class org.apache.nutch.parse.ext.ExtParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.feed.FeedParser
-
Parses the given feed and extracts out and parsers all linked items within
the feed, using the underlying ROME feed parsing library.
- getParse(Content) - Method in class org.apache.nutch.parse.html.HtmlParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- getParse(Content) - Method in interface org.apache.nutch.parse.Parser
-
This method parses the given content and returns a map of
<key, parse> pairs.
- getParse(Content) - Method in class org.apache.nutch.parse.swf.SWFParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.tika.TikaParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.zip.ZipParser
-
- getParseMeta() - Method in class org.apache.nutch.parse.ParseData
-
Other content properties.
- getParserById(String) - Method in class org.apache.nutch.parse.ParserFactory
-
Function returns a
Parser
instance with the specified
extId
, representing its extension ID.
- getParsers(String, String) - Method in class org.apache.nutch.parse.ParserFactory
-
Function returns an array of
Parser
s for a given content type.
- getPartition(FloatWritable, Writable, int) - Method in class org.apache.nutch.crawl.Generator.Selector
-
Partition by host / domain or IP.
- getPartition(Text, Writable, int) - Method in class org.apache.nutch.crawl.URLPartitioner
-
Hash by domain name.
- getPassAllFilter() - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Returns PathFilter that passes all paths through.
- getPassDirectoriesFilter(FileSystem) - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Returns PathFilter that passes directories through.
- getPaths(FileStatus[]) - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Turns an array of FileStatus into an array of Paths.
- getPluginClass() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the fully qualified name of the class which implements the abstarct
Plugin
class.
- getPluginDescriptor(String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns the descriptor of one plugin identified by a plugin id.
- getPluginDescriptors() - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns all registed plugin descriptors.
- getPluginFolder(String) - Method in class org.apache.nutch.plugin.PluginManifestParser
-
Return the named plugin folder.
- getPluginId() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the unique identifier of the plug-in or null
.
- getPluginInstance(PluginDescriptor) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns a instance of a plugin.
- getPluginPath() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the directory path of the plugin.
- getPos() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns the current position in the file.
- getProgress() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns the percentage of progress in processing the file.
- getProtocol(String) - Method in class org.apache.nutch.protocol.ProtocolFactory
-
Returns the appropriate
Protocol
implementation for a url.
- getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.file.File
-
- getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
- getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getProtocolOutput(Text, CrawlDatum) - Method in interface org.apache.nutch.protocol.Protocol
-
Returns the
Content
for a fetchlist entry.
- getProviderName() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
- getProxyHost() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getProxyPort() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getRealm() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
-
Gets the realm used by the HttpAuthentication object during creation.
- getRealm() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Gets the realm attribute of the HttpBasicAuthentication object.
- getRecordReader(InputSplit, JobConf, Reporter) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputFormat
-
- getRecordReader(InputSplit, JobConf, Reporter) - Method in class org.apache.nutch.segment.ContentAsTextInputFormat
-
- getRecordReader(InputSplit, JobConf, Reporter) - Method in class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
-
- getRecordReader(InputSplit, JobConf, Reporter) - Method in class org.apache.nutch.tools.arc.ArcInputFormat
-
Returns the RecordReader
for reading the arc file.
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.indexer.IndexerOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.parse.ParseOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.segment.SegmentReader.TextOutputFormat
-
- getRefresh() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getRefreshHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getRefreshTime() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getResourceString(String, Locale) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a I18N'd resource string.
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.Http
-
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.httpclient.Http
-
Fetches the url
with a configured HTTP client and
gets the response.
- getRetriesSinceFetch() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getRobotRules(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.file.File
-
No robots parsing is done for file protocol.
- getRobotRules(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Get the robots rules for a given url
- getRobotRules(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getRobotRules(Text, CrawlDatum) - Method in interface org.apache.nutch.protocol.Protocol
-
Retrieve robot rules applicable for this url.
- getRobotRulesSet(Protocol, URL) - Method in class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
The hosts for which the caching of robots rules is yet to be done,
it sends a Ftp request to the host corresponding to the
URL
passed, gets robots file, parses the rules and caches the rules object
to avoid re-work in future.
- getRobotRulesSet(Protocol, URL) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
The hosts for which the caching of robots rules is yet to be done,
it sends a Http request to the host corresponding to the
URL
passed, gets robots file, parses the rules and caches the rules object
to avoid re-work in future.
- getRobotRulesSet(Protocol, Text) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
- getRobotRulesSet(Protocol, URL) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
- getRootNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Get the root node of the DOM being created.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Returns the name of the file of rules to use for
a particular implementation.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
-
Rules specified as a config property will override rules specified
as a config file.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
Rules specified as a config property will override rules specified
as a config file.
- getRuns() - Method in class org.apache.nutch.tools.Benchmark.BenchmarkResults
-
- getSchema() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns a path to the xml schema of a extension point.
- getScopedRules() - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
- getScore() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getScore() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getSignature() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getSignature(Configuration) - Static method in class org.apache.nutch.crawl.SignatureFactory
-
Return the default Signature implementation.
- getSplits(JobConf, int) - Method in class org.apache.nutch.fetcher.Fetcher.InputFormat
-
Don't split inputs, to keep things polite.
- getSplits(JobConf, int) - Method in class org.apache.nutch.fetcher.OldFetcher.InputFormat
-
Don't split inputs, to keep things polite.
- getSplits(JobConf, int) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputFormat
-
Return each index as a split.
- getStages() - Method in class org.apache.nutch.tools.Benchmark.BenchmarkResults
-
- getStats(Path, SegmentReader.SegmentReaderStats) - Method in class org.apache.nutch.segment.SegmentReader
-
- getStatus() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getStatus() - Method in class org.apache.nutch.parse.ParseData
-
The status of parsing the page.
- getStatus() - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- getStatus() - Method in class org.apache.nutch.util.domain.DomainSuffix
-
- getStatusName(byte) - Static method in class org.apache.nutch.crawl.CrawlDatum
-
- getSubColection(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Returns named subcollection
- getSubCollections(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Return names of collections url is part of
- getSystemName() - Method in class org.apache.nutch.protocol.ftp.Client
-
Fetches the system type name from the server and returns the string.
- getTargetPoint() - Method in class org.apache.nutch.plugin.Extension
-
Returns the Id of the extension point, that is implemented by this
extension.
- getText(StringBuffer, Node, boolean) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method takes a
StringBuffer
and a DOM
Node
,
and will append all the content text found beneath the DOM node to
the
StringBuffer
.
- getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
- getText() - Method in interface org.apache.nutch.parse.Parse
-
The textual content of the page.
- getText() - Method in class org.apache.nutch.parse.ParseImpl
-
- getText() - Method in class org.apache.nutch.parse.ParseText
-
- getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
- getThrownError() - Method in class org.apache.nutch.util.CommandRunner
-
- getTimeout() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getTimeout() - Method in class org.apache.nutch.util.CommandRunner
-
- getTimestamp() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method takes a
StringBuffer
and a DOM
Node
,
and will append the content text found beneath the first
title
node to the
StringBuffer
.
- getTitle() - Method in class org.apache.nutch.parse.ParseData
-
The title of the page.
- getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
This method takes a
StringBuffer
and a DOM
Node
,
and will append the content text found beneath the first
title
node to the
StringBuffer
.
- getTopLevelDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the top level domain name of the url.
- getTopLevelDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the top level domain name of the url.
- getToUrl() - Method in class org.apache.nutch.parse.Outlink
-
- getTstamp() - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- getType() - Method in class org.apache.nutch.util.domain.TopLevelDomain
-
- getTypes() - Method in class org.apache.nutch.crawl.NutchWritable
-
- getUniqueKey() - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getUrl() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the URL used to retrieve this response.
- getUrl() - Method in exception org.apache.nutch.parse.ParserNotFound
-
- getUrl() - Method in class org.apache.nutch.protocol.Content
-
The url fetched.
- getUrl() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getUrl() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getUrl() - Method in exception org.apache.nutch.protocol.ProtocolNotFound
-
- getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- getUseHttp11() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getUserAgent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getUUID(Configuration) - Static method in class org.apache.nutch.util.NutchConfiguration
-
Retrieve a Nutch UUID of this configuration object, or null
if the configuration was created elsewhere.
- getValues() - Method in class org.apache.nutch.indexer.NutchField
-
- getValues(String) - Method in class org.apache.nutch.metadata.Metadata
-
Get the values associated to a metadata name.
- getValues(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- getVersion() - Method in class org.apache.nutch.parse.ParseData
-
- getVersion() - Method in class org.apache.nutch.parse.ParseStatus
-
- getVersion() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
- getWaitForExit() - Method in class org.apache.nutch.util.CommandRunner
-
- getWeight() - Method in class org.apache.nutch.indexer.NutchDocument
-
- getWeight() - Method in class org.apache.nutch.indexer.NutchField
-
- getWhiteList() - Method in class org.apache.nutch.collection.Subcollection
-
Returns whitelist
- getWhiteListString() - Method in class org.apache.nutch.collection.Subcollection
-
Returns whitelist String
- getWriter() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Return null since there is no Writer for this class.
- GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource is gone.
- guessEncoding(Content, String) - Method in class org.apache.nutch.util.EncodingDetector
-
Guess the encoding with the previously specified list of clues.
- GZIPUtils - Class in org.apache.nutch.util
-
A collection of utility methods for working on GZIPed data.
- GZIPUtils() - Constructor for class org.apache.nutch.util.GZIPUtils
-
- ID_FIELD - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- IDENTIFIER - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Recommended best practice is to identify the resource by means of a
string or number conforming to a formal identification system.
- ignorableWhitespace(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of ignorable whitespace in element content.
- IGNORE_INTERNAL_LINKS - Static variable in class org.apache.nutch.crawl.LinkDb
-
- in - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- INC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- index(Path, Path, List<Path>) - Method in class org.apache.nutch.indexer.IndexingJob
-
- index(Path, Path, List<Path>, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
-
- index(Path, Path, List<Path>, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
-
- index(Path, Path, List<Path>, boolean, boolean, String) - Method in class org.apache.nutch.indexer.IndexingJob
-
- index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
-
- INDEXER_DELETE - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- INDEXER_DELETE_ROBOTS_NOINDEX - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- INDEXER_PARAMS - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- INDEXER_SKIP_NOTMODIFIED - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- IndexerMapReduce - Class in org.apache.nutch.indexer
-
- IndexerMapReduce() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce
-
- IndexerOutputFormat - Class in org.apache.nutch.indexer
-
- IndexerOutputFormat() - Constructor for class org.apache.nutch.indexer.IndexerOutputFormat
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Dampen the boost value by scorePower.
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method calculates a Lucene document boost.
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- IndexingException - Exception in org.apache.nutch.indexer
-
- IndexingException() - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingException(String) - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingException(String, Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingException(Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingFilter - Interface in org.apache.nutch.indexer
-
Extension point for indexing.
- INDEXINGFILTER_ORDER - Static variable in class org.apache.nutch.indexer.IndexingFilters
-
- IndexingFilters - Class in org.apache.nutch.indexer
-
- IndexingFilters(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingFilters
-
- IndexingFiltersChecker - Class in org.apache.nutch.indexer
-
Reads and parses a URL and run the indexers on it.
- IndexingFiltersChecker() - Constructor for class org.apache.nutch.indexer.IndexingFiltersChecker
-
- IndexingJob - Class in org.apache.nutch.indexer
-
Generic indexer which relies on the plugins implementing IndexWriter
- IndexingJob() - Constructor for class org.apache.nutch.indexer.IndexingJob
-
- IndexingJob(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingJob
-
- IndexWriter - Interface in org.apache.nutch.indexer
-
- IndexWriters - Class in org.apache.nutch.indexer
-
- IndexWriters(Configuration) - Constructor for class org.apache.nutch.indexer.IndexWriters
-
- inflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array.
- inflateBestEffort(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array.
- inflateBestEffort(byte[], int) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array, truncated to
sizeLimit
bytes, if necessary.
- init() - Method in class org.apache.nutch.collection.CollectionManager
-
- init(Path) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- init(FilterConfig) - Method in class org.apache.nutch.tools.proxy.LogDebugHandler
-
- initialize(Element) - Method in class org.apache.nutch.collection.Subcollection
-
Initialize Subcollection from dom element
- initializeSchedule(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
Initialize fetch schedule related data.
- initializeSchedule(Text, CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Initialize fetch schedule related data.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Set to 0.0f (unknown value) - inlink contributions will bring it to
a correct level.
- initialScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Set an initial score for newly discovered pages.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a new initial score, used when adding newly discovered pages.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- initMRJob(Path, Path, Collection<Path>, JobConf) - Static method in class org.apache.nutch.indexer.IndexerMapReduce
-
- inject(Path, Path) - Method in class org.apache.nutch.crawl.Injector
-
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- injectedScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Set an initial score for newly injected pages.
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a new initial score, used when injecting new pages.
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- Injector - Class in org.apache.nutch.crawl
-
This class takes a flat file of URLs and adds them to the of pages to be
crawled.
- Injector() - Constructor for class org.apache.nutch.crawl.Injector
-
- Injector(Configuration) - Constructor for class org.apache.nutch.crawl.Injector
-
- Injector.InjectMapper - Class in org.apache.nutch.crawl
-
Normalize and filter injected urls.
- Injector.InjectMapper() - Constructor for class org.apache.nutch.crawl.Injector.InjectMapper
-
- Injector.InjectReducer - Class in org.apache.nutch.crawl
-
Combine multiple new entries for a url.
- Injector.InjectReducer() - Constructor for class org.apache.nutch.crawl.Injector.InjectReducer
-
- Inlink - Class in org.apache.nutch.crawl
-
- Inlink() - Constructor for class org.apache.nutch.crawl.Inlink
-
- Inlink(String, String) - Constructor for class org.apache.nutch.crawl.Inlink
-
- INLINK - Static variable in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- INLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
-
- Inlinks - Class in org.apache.nutch.crawl
-
- Inlinks() - Constructor for class org.apache.nutch.crawl.Inlinks
-
- install(JobConf, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
-
- install(JobConf, Path) - Static method in class org.apache.nutch.crawl.LinkDb
-
- invert(Path, Path, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
-
- invert(Path, Path[], boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
-
- isAllowed(URL) - Method in interface org.apache.nutch.protocol.RobotRules
-
Returns false
if the robots.txt
file
prohibits us from accessing the given url
, or
true
otherwise.
- isCanonical() - Method in interface org.apache.nutch.parse.Parse
-
Indicates if the parse is coming from a url or a sub-url
- isCanonical() - Method in class org.apache.nutch.parse.ParseImpl
-
- isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- isDomainSuffix(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
-
return whether the extension is a registered domain entry
- isEmpty() - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- isEmpty() - Method in class org.apache.nutch.parse.ParseResult
-
Checks whether the result is empty.
- isEmpty(String) - Static method in class org.apache.nutch.util.StringUtil
-
Checks if a string is empty (ie is null or empty).
- isFound() - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- isIgnoreCase() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- isMagic(byte[]) - Static method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns true if the byte array passed matches the gzip header magic
number.
- isModeAccept() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- isMultiValued(String) - Method in class org.apache.nutch.metadata.Metadata
-
Returns true if named value is multivalued.
- isParsing(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
-
- isParsing(Configuration) - Static method in class org.apache.nutch.fetcher.OldFetcher
-
- isPermanentFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- isRemoteVerificationEnabled() - Method in class org.apache.nutch.protocol.ftp.Client
-
Return whether or not verification of the remote host participating
in data connections is enabled.
- isSameDomainName(URL, URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns whether the given urls have the same domain name.
- isSameDomainName(String, String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns whether the given urls have the same domain name.
- isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- isStoringContent(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
-
- isStoringContent(Configuration) - Static method in class org.apache.nutch.fetcher.OldFetcher
-
- isSuccess() - Method in class org.apache.nutch.parse.ParseResult
-
A convenience method which returns true only if all parses are successful.
- isSuccess() - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- isSuccess() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- isTransientFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- isTruncated(Content) - Static method in class org.apache.nutch.parse.ParseSegment
-
Checks if the page's content is truncated.
- isWhiteSpace(char) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Returns whether the specified ch conforms to the XML 1.0 definition
of whitespace.
- isWhiteSpace(char[], int, int) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- isWhiteSpace(StringBuffer) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- isWhiteSpace(String) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- iterator() - Method in class org.apache.nutch.crawl.Inlinks
-
- iterator() - Method in class org.apache.nutch.indexer.NutchDocument
-
Iterate over all fields.
- iterator() - Method in class org.apache.nutch.parse.ParseResult
-
Iterate over all entries in the <url, Parse> map.
- LANGUAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A language of the intellectual content of the resource.
- LanguageIndexingFilter - Class in org.apache.nutch.analysis.lang
-
- LanguageIndexingFilter() - Constructor for class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
Constructs a new Language Indexing Filter.
- LAST_MODIFIED - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- leftPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
-
Returns a copy of s
padded with leading spaces so
that it's length is length
.
- LICENSE_LOCATION - Static variable in interface org.apache.nutch.metadata.CreativeCommons
-
- LICENSE_URL - Static variable in interface org.apache.nutch.metadata.CreativeCommons
-
- LinkAnalysisScoringFilter - Class in org.apache.nutch.scoring.link
-
- LinkAnalysisScoringFilter() - Constructor for class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- LinkDatum - Class in org.apache.nutch.scoring.webgraph
-
A class for holding link information including the url, anchor text, a score,
the timestamp of the link and a link type.
- LinkDatum() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Default constructor, no url, timestamp, score, or link type.
- LinkDatum(String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Creates a LinkDatum with a given url.
- LinkDatum(String, String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Creates a LinkDatum with a url and an anchor text.
- LinkDatum(String, String, long) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
- LinkDb - Class in org.apache.nutch.crawl
-
Maintains an inverted link map, listing incoming links for each url.
- LinkDb() - Constructor for class org.apache.nutch.crawl.LinkDb
-
- LinkDb(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDb
-
- LinkDbFilter - Class in org.apache.nutch.crawl
-
This class provides a way to separate the URL normalization
and filtering steps from the rest of LinkDb manipulation code.
- LinkDbFilter() - Constructor for class org.apache.nutch.crawl.LinkDbFilter
-
- LinkDbMerger - Class in org.apache.nutch.crawl
-
This tool merges several LinkDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited URLs and
links.
- LinkDbMerger() - Constructor for class org.apache.nutch.crawl.LinkDbMerger
-
- LinkDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDbMerger
-
- LinkDbReader - Class in org.apache.nutch.crawl
-
.
- LinkDbReader() - Constructor for class org.apache.nutch.crawl.LinkDbReader
-
- LinkDbReader(Configuration, Path) - Constructor for class org.apache.nutch.crawl.LinkDbReader
-
- LinkDumper - Class in org.apache.nutch.scoring.webgraph
-
The LinkDumper tool creates a database of node to inlink information that can
be read using the nested Reader class.
- LinkDumper() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper
-
- LinkDumper.Inverter - Class in org.apache.nutch.scoring.webgraph
-
Inverts outlinks from the WebGraph to inlinks and attaches node
information.
- LinkDumper.Inverter() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
- LinkDumper.LinkNode - Class in org.apache.nutch.scoring.webgraph
-
Bean class which holds url to node information.
- LinkDumper.LinkNode() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- LinkDumper.LinkNode(String, Node) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- LinkDumper.LinkNodes - Class in org.apache.nutch.scoring.webgraph
-
Writable class which holds an array of LinkNode objects.
- LinkDumper.LinkNodes() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- LinkDumper.LinkNodes(LinkDumper.LinkNode[]) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- LinkDumper.Merger - Class in org.apache.nutch.scoring.webgraph
-
Merges LinkNode objects into a single array value per url.
- LinkDumper.Merger() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
- LinkDumper.Reader - Class in org.apache.nutch.scoring.webgraph
-
Reader class which will print out the url and all of its inlinks to system
out.
- LinkDumper.Reader() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
-
- LinkRank - Class in org.apache.nutch.scoring.webgraph
-
- LinkRank() - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
-
Default constructor.
- LinkRank(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
-
Configurable constructor.
- list(List<Path>, Writer) - Method in class org.apache.nutch.segment.SegmentReader
-
- LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- LOCK_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- LOCK_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
-
- LOCK_NAME - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
-
- LockUtil - Class in org.apache.nutch.util
-
Utility methods for handling application-level locking.
- LockUtil() - Constructor for class org.apache.nutch.util.LockUtil
-
- LOG - Static variable in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- LOG - Static variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- LOG - Static variable in class org.apache.nutch.crawl.Crawl
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDbReader
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDbReducer
-
- LOG - Static variable in class org.apache.nutch.crawl.FetchScheduleFactory
-
- LOG - Static variable in class org.apache.nutch.crawl.Generator
-
- LOG - Static variable in class org.apache.nutch.crawl.Injector
-
- LOG - Static variable in class org.apache.nutch.crawl.LinkDb
-
- LOG - Static variable in class org.apache.nutch.crawl.LinkDbFilter
-
- LOG - Static variable in class org.apache.nutch.crawl.LinkDbReader
-
- LOG - Static variable in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- LOG - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- LOG - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- LOG - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- LOG - Static variable in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexer.CleaningJob
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexingFilters
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexingJob
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexWriters
-
- LOG - Static variable in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- LOG - Static variable in class org.apache.nutch.indexer.solr.SolrUtils
-
- LOG - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
Logger
- LOG - Static variable in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- LOG - Static variable in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- LOG - Static variable in class org.apache.nutch.indexwriter.solr.SolrUtils
-
- LOG - Static variable in class org.apache.nutch.microformats.reltag.RelTagParser
-
- LOG - Static variable in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
-
- LOG - Static variable in class org.apache.nutch.net.URLNormalizers
-
- LOG - Static variable in class org.apache.nutch.parse.ext.ExtParser
-
- LOG - Static variable in class org.apache.nutch.parse.feed.FeedParser
-
- LOG - Static variable in class org.apache.nutch.parse.html.HtmlParser
-
- LOG - Static variable in class org.apache.nutch.parse.js.JSParseFilter
-
- LOG - Static variable in class org.apache.nutch.parse.ParserChecker
-
- LOG - Static variable in class org.apache.nutch.parse.ParseResult
-
- LOG - Static variable in class org.apache.nutch.parse.ParserFactory
-
- LOG - Static variable in class org.apache.nutch.parse.ParseSegment
-
- LOG - Static variable in class org.apache.nutch.parse.ParseUtil
-
- LOG - Static variable in class org.apache.nutch.parse.swf.SWFParser
-
- LOG - Static variable in class org.apache.nutch.parse.tika.TikaParser
-
- LOG - Static variable in class org.apache.nutch.parse.zip.ZipTextExtractor
-
- LOG - Static variable in class org.apache.nutch.plugin.PluginDescriptor
-
- LOG - Static variable in class org.apache.nutch.plugin.PluginManifestParser
-
- LOG - Static variable in class org.apache.nutch.plugin.PluginRepository
-
- LOG - Static variable in class org.apache.nutch.protocol.file.File
-
- LOG - Static variable in class org.apache.nutch.protocol.ftp.Ftp
-
- LOG - Static variable in class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
- LOG - Static variable in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
- LOG - Static variable in class org.apache.nutch.protocol.http.Http
-
- LOG - Static variable in class org.apache.nutch.protocol.httpclient.Http
-
- LOG - Static variable in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- LOG - Static variable in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
- LOG - Static variable in class org.apache.nutch.protocol.ProtocolFactory
-
- LOG - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.LinkDumper
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.LinkRank
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.Loops
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.NodeDumper
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
-
- LOG - Static variable in class org.apache.nutch.segment.SegmentReader
-
- LOG - Static variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- LOG - Static variable in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- LOG - Static variable in class org.apache.nutch.tools.CrawlDBScanner
-
- LOG - Static variable in class org.apache.nutch.tools.DmozParser
-
- LOG - Static variable in class org.apache.nutch.tools.ResolveUrls
-
- LOG - Static variable in class org.apache.nutch.util.EncodingDetector
-
- LOG - Static variable in class org.creativecommons.nutch.CCIndexingFilter
-
- LOG - Static variable in class org.creativecommons.nutch.CCParseFilter
-
- logConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- LogDebugHandler - Class in org.apache.nutch.tools.proxy
-
- LogDebugHandler() - Constructor for class org.apache.nutch.tools.proxy.LogDebugHandler
-
- login(String, String) - Method in class org.apache.nutch.protocol.ftp.Client
-
Login to the FTP server using the provided username and password.
- logout() - Method in class org.apache.nutch.protocol.ftp.Client
-
Logout of the FTP server by sending the QUIT command.
- longestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns the longest prefix of input that is matched,
or null if no match exists.
- longestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns the longest suffix of input that is matched,
or null if no match exists.
- longestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the longest substring of input that is
matched by a pattern in the trie, or null if no match
exists.
- LoopReader - Class in org.apache.nutch.scoring.webgraph
-
The LoopReader tool prints the loopset information for a single url.
- LoopReader() - Constructor for class org.apache.nutch.scoring.webgraph.LoopReader
-
- LoopReader(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.LoopReader
-
- Loops - Class in org.apache.nutch.scoring.webgraph
-
The Loops job identifies cycles of loops inside of the web graph.
- Loops() - Constructor for class org.apache.nutch.scoring.webgraph.Loops
-
- Loops.Finalizer - Class in org.apache.nutch.scoring.webgraph
-
Finishes the Loops job by aggregating and collecting and found routes.
- Loops.Finalizer() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Default constructor.
- Loops.Finalizer(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Configurable constructor.
- Loops.Initializer - Class in org.apache.nutch.scoring.webgraph
-
Initializes the Loop routes.
- Loops.Initializer() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Default constructor.
- Loops.Initializer(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Configurable constructor.
- Loops.Looper - Class in org.apache.nutch.scoring.webgraph
-
Follows a route path looking for the start url of the route.
- Loops.Looper() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Default constructor.
- Loops.Looper(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Configurable constructor.
- Loops.LoopSet - Class in org.apache.nutch.scoring.webgraph
-
A set of loops.
- Loops.LoopSet() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- Loops.Route - Class in org.apache.nutch.scoring.webgraph
-
A link path or route looking to identify a link cycle.
- Loops.Route() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Route
-
- LOOPS_DIR - Static variable in class org.apache.nutch.scoring.webgraph.Loops
-
- m_currentNode - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Current node
- m_doc - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Root document
- m_docFrag - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
First node of document fragment or null if not a DocumentFragment
- m_elemStack - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Vector of element nodes
- m_inCData - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Flag indicating that we are processing a CData section
- main(String[]) - Static method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- main(String[]) - Static method in class org.apache.nutch.crawl.Crawl
-
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDb
-
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
-
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbReader
-
- main(String[]) - Static method in class org.apache.nutch.crawl.Generator
-
Generate a fetchlist from the crawldb.
- main(String[]) - Static method in class org.apache.nutch.crawl.Injector
-
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDb
-
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbMerger
-
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbReader
-
- main(String[]) - Static method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- main(String[]) - Static method in class org.apache.nutch.crawl.TextProfileSignature
-
- main(String[]) - Static method in class org.apache.nutch.fetcher.Fetcher
-
Run the fetcher.
- main(String[]) - Static method in class org.apache.nutch.fetcher.OldFetcher
-
Run the fetcher.
- main(String[]) - Static method in class org.apache.nutch.indexer.CleaningJob
-
- main(String[]) - Static method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- main(String[]) - Static method in class org.apache.nutch.indexer.IndexingJob
-
- main(String[]) - Static method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- main(String[]) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
-
- main(String[]) - Static method in class org.apache.nutch.net.URLFilterChecker
-
- main(String[]) - Static method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
Spits out patterns and substitutions that are in the configuration file.
- main(String[]) - Static method in class org.apache.nutch.net.URLNormalizerChecker
-
- main(String[]) - Static method in class org.apache.nutch.parse.feed.FeedParser
-
Runs a command line version of this
Parser
.
- main(String[]) - Static method in class org.apache.nutch.parse.html.HtmlParser
-
- main(String[]) - Static method in class org.apache.nutch.parse.js.JSParseFilter
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParseData
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParserChecker
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParseSegment
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParseText
-
- main(String[]) - Static method in class org.apache.nutch.parse.swf.SWFParser
-
Arguments are: 0.
- main(String[]) - Static method in class org.apache.nutch.plugin.PluginRepository
-
Loads all necessary dependencies for a selected plugin, and then runs one
of the classes' main() method.
- main(String[]) - Static method in class org.apache.nutch.protocol.Content
-
- main(String[]) - Static method in class org.apache.nutch.protocol.file.File
-
Quick way for running this class.
- main(String[]) - Static method in class org.apache.nutch.protocol.ftp.Ftp
-
For debugging.
- main(HttpBase, String[]) - Static method in class org.apache.nutch.protocol.http.api.HttpBase
-
- main(String[]) - Static method in class org.apache.nutch.protocol.http.Http
-
- main(String[]) - Static method in class org.apache.nutch.protocol.httpclient.Http
-
Main method.
- main(String[]) - Static method in class org.apache.nutch.protocol.RobotRulesParser
-
command-line main for testing
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkRank
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LoopReader
-
Runs the LoopReader tool.
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.Loops
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeReader
-
Runs the NodeReader tool.
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.WebGraph
-
- main(String[]) - Static method in class org.apache.nutch.segment.SegmentMerger
-
- main(String[]) - Static method in class org.apache.nutch.segment.SegmentReader
-
- main(String[]) - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- main(String[]) - Static method in class org.apache.nutch.tools.Benchmark
-
- main(String[]) - Static method in class org.apache.nutch.tools.CrawlDBScanner
-
- main(String[]) - Static method in class org.apache.nutch.tools.DmozParser
-
Command-line access.
- main(String[]) - Static method in class org.apache.nutch.tools.FreeGenerator
-
- main(String[]) - Static method in class org.apache.nutch.tools.proxy.TestbedProxy
-
- main(String[]) - Static method in class org.apache.nutch.tools.ResolveUrls
-
Runs the resolve urls tool.
- main(RegexURLFilterBase, String[]) - Static method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Filter the standard input using a RegexURLFilterBase.
- main(String[]) - Static method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.util.CommandRunner
-
- main(String[]) - Static method in class org.apache.nutch.util.domain.DomainStatistics
-
- main(String[]) - Static method in class org.apache.nutch.util.EncodingDetector
-
- main(String[]) - Static method in class org.apache.nutch.util.PrefixStringMatcher
-
- main(String[]) - Static method in class org.apache.nutch.util.StringUtil
-
- main(String[]) - Static method in class org.apache.nutch.util.SuffixStringMatcher
-
- main(String[]) - Static method in class org.apache.nutch.util.URLUtil
-
For testing
- majorCodes - Static variable in class org.apache.nutch.parse.ParseStatus
-
- makeIOException(SolrServerException) - Static method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- map(Text, CrawlDatum, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbFilter
-
- map(Text, CrawlDatum, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- map(Text, CrawlDatum, OutputCollector<Text, LongWritable>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- map(Text, CrawlDatum, OutputCollector<FloatWritable, Text>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- map(Text, CrawlDatum, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- map(Text, CrawlDatum, OutputCollector<FloatWritable, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.crawl.Generator.Selector
-
Select & invert subset due for fetch.
- map(FloatWritable, Generator.SelectorEntry, OutputCollector<Text, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.crawl.Generator.SelectorInverseMapper
-
- map(WritableComparable<?>, Text, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
-
- map(Text, ParseData, OutputCollector<Text, Inlinks>, Reporter) - Method in class org.apache.nutch.crawl.LinkDb
-
- map(Text, Inlinks, OutputCollector<Text, Inlinks>, Reporter) - Method in class org.apache.nutch.crawl.LinkDbFilter
-
- map(Text, CrawlDatum, OutputCollector<ByteWritable, Text>, Reporter) - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- map(Text, Writable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- map(WritableComparable<?>, Content, OutputCollector<Text, ParseImpl>, Reporter) - Method in class org.apache.nutch.parse.ParseSegment
-
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
Wraps all values in ObjectWritables.
- map(Text, Loops.Route, OutputCollector<Text, Loops.Route>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Maps out and found routes, those will be the link cycles.
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Wraps values in ObjectWritable.
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Wrap values in ObjectWritable.
- map(Text, Node, OutputCollector<Text, FloatWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
Outputs the host or domain as key for this record and numInlinks, numOutlinks
or score as the value.
- map(Text, Node, OutputCollector<FloatWritable, Text>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
Outputs the url with the appropriate number of inlinks, outlinks, or for
score.
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Changes input into ObjectWritables.
- map(Text, Writable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
Passes through existing LinkDatum objects from an existing OutlinkDb and
maps out new LinkDatum objects from new crawls ParseData.
- map(Text, MetaWrapper, OutputCollector<Text, MetaWrapper>, Reporter) - Method in class org.apache.nutch.segment.SegmentMerger
-
- map(WritableComparable<?>, Writable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatMapper
-
- map(Text, BytesWritable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Runs the Map job to translate an arc record into output for Nutch
segments.
- map(Text, CrawlDatum, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.tools.CrawlDBScanner
-
- map(WritableComparable<?>, Text, OutputCollector<Text, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.tools.FreeGenerator.FG
-
- mapCopyKey(String) - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- mapKey(String) - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- MAPPING_FILE - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- MAPPING_FILE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- MapWritable - Class in org.apache.nutch.crawl
-
Deprecated.
Use org.apache.hadoop.io.MapWritable instead.
- MapWritable() - Constructor for class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- MapWritable(MapWritable) - Constructor for class org.apache.nutch.crawl.MapWritable
-
Deprecated.
Copy constructor.
- match(String) - Method in class org.apache.nutch.urlfilter.api.RegexRule
-
Checks if a url matches this rule.
- matchChar(TrieStringMatcher.TrieNode, String, int) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the next
TrieStringMatcher.TrieNode
visited, given that you are at
node
, and the the next character in the input is
the
idx
'th character of
s
.
- matches(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns true if the given String
is matched by a
prefix in the trie
- matches(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns true if the given String
is matched by a
suffix in the trie
- matches(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns true if the given String
is matched by a
pattern in the trie
- maxContent - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The length limit for downloaded content, in bytes.
- maxCrawlDelay - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Skip page if Crawl-Delay longer than this value.
- maxInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
-
- MD5Signature - Class in org.apache.nutch.crawl
-
Default implementation of a page signature.
- MD5Signature() - Constructor for class org.apache.nutch.crawl.MD5Signature
-
- merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDbMerger
-
- merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- merge(Path, Path[], boolean, boolean, long) - Method in class org.apache.nutch.segment.SegmentMerger
-
- Metadata - Class in org.apache.nutch.metadata
-
A multi-valued metadata container.
- Metadata() - Constructor for class org.apache.nutch.metadata.Metadata
-
Constructs a new, empty metadata.
- MetadataIndexer - Class in org.apache.nutch.indexer.metadata
-
Indexer which can be configured to extract metadata from the crawldb, parse metadata or content metadata.
- MetadataIndexer() - Constructor for class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- MetaTagsParser - Class in org.apache.nutch.parse
-
Parse HTML meta tags (keywords, description) and store them in the parse metadata so that
they can be indexed with the index-metadata plugin with the prefix 'metatag.'
- MetaTagsParser() - Constructor for class org.apache.nutch.parse.MetaTagsParser
-
- MetaWrapper - Class in org.apache.nutch.metadata
-
This is a simple decorator that adds metadata to any Writable-s that can be
serialized by NutchWritable.
- MetaWrapper() - Constructor for class org.apache.nutch.metadata.MetaWrapper
-
- MetaWrapper(Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
-
- MetaWrapper(Metadata, Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
-
- MimeAdaptiveFetchSchedule - Class in org.apache.nutch.crawl
-
Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration
of DEC and INC factors for various MIME-types.
- MimeAdaptiveFetchSchedule() - Constructor for class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- MimeUtil - Class in org.apache.nutch.util
-
- MimeUtil(Configuration) - Constructor for class org.apache.nutch.util.MimeUtil
-
- MIN_CONFIDENCE_KEY - Static variable in class org.apache.nutch.util.EncodingDetector
-
- MissingDependencyException - Exception in org.apache.nutch.plugin
-
MissingDependencyException
will be thrown if a plugin
dependency cannot be found.
- MissingDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
-
- MissingDependencyException(String) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
-
- MODIFIED - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Date on which the resource was changed.
- MoreIndexingFilter - Class in org.apache.nutch.indexer.more
-
Add (or reset) a few metaData properties as respective fields (if they are
available), so that they can be accurately used within the search index.
- MoreIndexingFilter() - Constructor for class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- MOVED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource has moved permanently.
- PARAMS - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- PARAMS - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
Deprecated.
- parse(InputStream) - Method in class org.apache.nutch.collection.CollectionManager
-
- Parse - Interface in org.apache.nutch.parse
-
The result of parsing a page's raw content.
- parse(Path) - Method in class org.apache.nutch.parse.ParseSegment
-
- parse(Content) - Method in class org.apache.nutch.parse.ParseUtil
-
Performs a parse by iterating through a List of preferred
Parser
s
until a successful parse is performed and a
Parse
object is
returned.
- parse(String) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a String in format "segmentName/partName".
- PARSE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- parseByExtensionId(String, Content) - Method in class org.apache.nutch.parse.ParseUtil
-
Method parses a
Content
object using the
Parser
specified
by the parameter
extId
, i.e., the Parser's extension ID.
- parseCharacterEncoding(String) - Static method in class org.apache.nutch.util.EncodingDetector
-
Parse the character encoding from the specified content type header.
- parsed - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- ParseData - Class in org.apache.nutch.parse
-
Data extracted from a page's content.
- ParseData() - Constructor for class org.apache.nutch.parse.ParseData
-
- ParseData(ParseStatus, String, Outlink[], Metadata) - Constructor for class org.apache.nutch.parse.ParseData
-
- ParseData(ParseStatus, String, Outlink[], Metadata, Metadata) - Constructor for class org.apache.nutch.parse.ParseData
-
- parseDmozFile(File, int, boolean, int, Pattern) - Method in class org.apache.nutch.tools.DmozParser
-
Iterate through all the items in this structured DMOZ file.
- parseErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- ParseException - Exception in org.apache.nutch.parse
-
- ParseException() - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseException(String) - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseException(String, Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseException(Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseImpl - Class in org.apache.nutch.parse
-
The result of parsing a page's raw content.
- ParseImpl() - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(Parse) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(String, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(ParseText, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(ParseText, ParseData, boolean) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- parseList(ArrayList, String) - Method in class org.apache.nutch.collection.Subcollection
-
Create a list of patterns from chunk of text, patterns are separated with
newline
- ParseOutputFormat - Class in org.apache.nutch.parse
-
- ParseOutputFormat() - Constructor for class org.apache.nutch.parse.ParseOutputFormat
-
- parsePluginFolder(String[]) - Method in class org.apache.nutch.plugin.PluginManifestParser
-
Returns a list of all found plugin descriptors.
- Parser - Interface in org.apache.nutch.parse
-
A parser for content generated by a
Protocol
implementation.
- ParserChecker - Class in org.apache.nutch.parse
-
Parser checker, useful for testing parser.
- ParserChecker() - Constructor for class org.apache.nutch.parse.ParserChecker
-
- ParseResult - Class in org.apache.nutch.parse
-
A utility class that stores result of a parse.
- ParseResult(String) - Constructor for class org.apache.nutch.parse.ParseResult
-
Create a container for parse results.
- ParserFactory - Class in org.apache.nutch.parse
-
Creates and caches
Parser
plugins.
- ParserFactory(Configuration) - Constructor for class org.apache.nutch.parse.ParserFactory
-
- ParserNotFound - Exception in org.apache.nutch.parse
-
- ParserNotFound(String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
-
- ParserNotFound(String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
-
- ParserNotFound(String, String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
-
- parseRules(String, byte[], String, String) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Parses the robots content using the SimpleRobotRulesParser
from crawler commons
- ParseSegment - Class in org.apache.nutch.parse
-
- ParseSegment() - Constructor for class org.apache.nutch.parse.ParseSegment
-
- ParseSegment(Configuration) - Constructor for class org.apache.nutch.parse.ParseSegment
-
- ParseStatus - Class in org.apache.nutch.parse
-
- ParseStatus() - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, int) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
-
Simplified constructor for passing just a text message.
- ParseStatus(int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
-
Simplified constructor for passing just a text message.
- ParseStatus(Throwable) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseText - Class in org.apache.nutch.parse
-
- ParseText() - Constructor for class org.apache.nutch.parse.ParseText
-
- ParseText(String) - Constructor for class org.apache.nutch.parse.ParseText
-
- ParseUtil - Class in org.apache.nutch.parse
-
A Utility class containing methods to simply perform parsing utilities such
as iterating through a preferred list of
Parser
s to obtain
Parse
objects.
- ParseUtil(Configuration) - Constructor for class org.apache.nutch.parse.ParseUtil
-
- PARTITION_MODE_DOMAIN - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- PARTITION_MODE_HOST - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- PARTITION_MODE_IP - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- PARTITION_MODE_KEY - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- partName - Variable in class org.apache.nutch.segment.SegmentPart
-
Name of the segment part (ie.
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
- passScoreAfterParsing(Text, Content, Parse) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Currently a part of score distribution is performed using only data coming
from the parsing process.
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Takes the metadata, which was lumped inside the content, and replicates it
within your parse data.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content
metadata.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Takes the metadata, specified in your "urlmeta.tags" property, from the
datum object and injects it into the content.
- PassURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.pass
-
This URLNormalizer doesn't change urls.
- PassURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
-
- PASSWORD - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- PASSWORD - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- PERM_REFRESH_TIME - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- PERM_REFRESH_TIME - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- Pluggable - Interface in org.apache.nutch.plugin
-
Defines the capability of a class to be plugged into Nutch.
- Plugin - Class in org.apache.nutch.plugin
-
A nutch-plugin is an container for a set of custom logic that provide
extensions to the nutch core functionality or another plugin that provides an
API for extending.
- Plugin(PluginDescriptor, Configuration) - Constructor for class org.apache.nutch.plugin.Plugin
-
Constructor
- PluginClassLoader - Class in org.apache.nutch.plugin
-
The PluginClassLoader
contains only classes of the runtime
libraries setuped in the plugin manifest file and exported libraries of
plugins that are required pluguin.
- PluginClassLoader(URL[], ClassLoader) - Constructor for class org.apache.nutch.plugin.PluginClassLoader
-
Construtor
- PluginDescriptor - Class in org.apache.nutch.plugin
-
The PluginDescriptor
provide access to all meta information of
a nutch-plugin, as well to the internationalizable resources and the plugin
own classloader.
- PluginDescriptor(String, String, String, String, String, String, Configuration) - Constructor for class org.apache.nutch.plugin.PluginDescriptor
-
Constructor
- PluginManifestParser - Class in org.apache.nutch.plugin
-
The PluginManifestParser
parser just parse the manifest file
in all plugin directories.
- PluginManifestParser(Configuration, PluginRepository) - Constructor for class org.apache.nutch.plugin.PluginManifestParser
-
- PluginRepository - Class in org.apache.nutch.plugin
-
The plugin repositority is a registry of all plugins.
- PluginRepository(Configuration) - Constructor for class org.apache.nutch.plugin.PluginRepository
-
- PluginRuntimeException - Exception in org.apache.nutch.plugin
-
PluginRuntimeException
will be thrown until a exception in the
plugin managemnt occurs.
- PluginRuntimeException(Throwable) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
-
- PluginRuntimeException(String) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
-
- pos - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- PrefixStringMatcher - Class in org.apache.nutch.util
-
A class for efficiently matching String
s against a set
of prefixes.
- PrefixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any prefix in the supplied array.
- PrefixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any prefix in the supplied
Collection
.
- PrefixURLFilter - Class in org.apache.nutch.urlfilter.prefix
-
Filters URLs based on a file of URL prefixes.
- PrefixURLFilter() - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- PrefixURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- PrintCommandListener - Class in org.apache.nutch.protocol.ftp
-
This is a support class for logging all ftp command/reply traffic.
- PrintCommandListener(Logger) - Constructor for class org.apache.nutch.protocol.ftp.PrintCommandListener
-
- processDeflateEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- processDumpJob(String, String, Configuration, String, String, String) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- processDumpJob(String, String) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- processGzipEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- processingInstruction(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of a processing instruction.
- processStatJob(String, Configuration, boolean) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- processTopNJob(String, long, float, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- PROTO_NOT_FOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
This protocol was not found.
- PROTO_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- Protocol - Interface in org.apache.nutch.protocol
-
A retriever of url content.
- PROTOCOL_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- PROTOCOL_REDIR - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- protocolCommandSent(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
-
- ProtocolException - Exception in org.apache.nutch.net.protocols
-
- ProtocolException() - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(String) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException - Exception in org.apache.nutch.protocol
-
- ProtocolException() - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolException(String) - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolException(Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolFactory - Class in org.apache.nutch.protocol
-
- ProtocolFactory(Configuration) - Constructor for class org.apache.nutch.protocol.ProtocolFactory
-
- ProtocolNotFound - Exception in org.apache.nutch.protocol
-
- ProtocolNotFound(String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
-
- ProtocolNotFound(String, String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
-
- ProtocolOutput - Class in org.apache.nutch.protocol
-
Simple aggregate to pass from protocol plugins both content and
protocol status.
- ProtocolOutput(Content, ProtocolStatus) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
-
- ProtocolOutput(Content) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
-
- protocolReplyReceived(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
-
- ProtocolStatus - Class in org.apache.nutch.protocol
-
- ProtocolStatus() - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, String[]) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, String[], long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, Object) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, Object, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(Throwable) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- proxyHost - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy hostname.
- proxyPort - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy port.
- PUBLISHER - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity responsible for making the resource available.
- put(Writable, Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- put(Text, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
-
Store a result of parsing.
- put(String, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
-
Store a result of parsing.
- putAll(MapWritable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- putAllMetaData(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Add all metadata from other CrawlDatum to this CrawlDatum.
- read(DataInput) - Static method in class org.apache.nutch.crawl.CrawlDatum
-
- read(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
-
- read(DataInput) - Static method in class org.apache.nutch.parse.Outlink
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseData
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseImpl
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseStatus
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseText
-
- read(DataInput) - Static method in class org.apache.nutch.protocol.Content
-
- read(DataInput) - Static method in class org.apache.nutch.protocol.ProtocolStatus
-
- readConfiguration(Reader) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlink
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlinks
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchDocument
-
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchField
-
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchIndexAction
-
- readFields(DataInput) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- readFields(DataInput) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- readFields(DataInput) - Method in class org.apache.nutch.metadata.Metadata
-
- readFields(DataInput) - Method in class org.apache.nutch.metadata.MetaWrapper
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.Outlink
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseData
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseImpl
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseStatus
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseText
-
- readFields(DataInput) - Method in class org.apache.nutch.protocol.Content
-
- readFields(DataInput) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- readFields(DataInput) - Method in class org.apache.nutch.util.GenericWritableConfigurable
-
- readSolrDocument(SolrDocument) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- readUrl(String, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Too many redirects.
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- reduce(Text, Iterator<LongWritable>, OutputCollector<Text, LongWritable>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- reduce(Text, Iterator<LongWritable>, OutputCollector<Text, LongWritable>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- reduce(FloatWritable, Iterator<Text>, OutputCollector<FloatWritable, Text>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReducer
-
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- reduce(Text, Iterator<Generator.SelectorEntry>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Generator.PartitionReducer
-
- reduce(FloatWritable, Iterator<Generator.SelectorEntry>, OutputCollector<FloatWritable, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.crawl.Generator.Selector
-
Collect until limit is reached.
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
-
- reduce(Text, Iterator<Inlinks>, OutputCollector<Text, Inlinks>, Reporter) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- reduce(ByteWritable, Iterator<Text>, OutputCollector<Text, ByteWritable>, Reporter) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- reduce(Text, Iterator<NutchWritable>, OutputCollector<Text, NutchIndexAction>, Reporter) - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- reduce(Text, Iterator<SolrDeleteDuplicates.SolrRecord>, OutputCollector<Text, SolrDeleteDuplicates.SolrRecord>, Reporter) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- reduce(Text, Iterator<Writable>, OutputCollector<Text, Writable>, Reporter) - Method in class org.apache.nutch.parse.ParseSegment
-
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, LinkDumper.LinkNode>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
Inverts outlinks to inlinks while attaching node information to the
outlink.
- reduce(Text, Iterator<LinkDumper.LinkNode>, OutputCollector<Text, LinkDumper.LinkNodes>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
Aggregate all LinkNode objects for a given url.
- reduce(Text, Iterator<Loops.Route>, OutputCollector<Text, Loops.LoopSet>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Aggregates all found routes for a given start url into a loopset and
collects the loopset.
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, Loops.Route>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Takes any node that has inlinks and sets up a route for all of its
outlinks.
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, Loops.Route>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Performs a single loop pass looking for loop cycles within routes.
- reduce(Text, Iterator<FloatWritable>, OutputCollector<Text, FloatWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
Outputs either the sum or the top value for this record.
- reduce(FloatWritable, Iterator<Text>, OutputCollector<Text, FloatWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
Flips and collects the url and numeric sort value.
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Creates new CrawlDatum objects with the updated score from the NodeDb or
with a cleared score.
- reduce(Text, Iterator<NutchWritable>, OutputCollector<Text, LinkDatum>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
- reduce(Text, Iterator<MetaWrapper>, OutputCollector<Text, MetaWrapper>, Reporter) - Method in class org.apache.nutch.segment.SegmentMerger
-
NOTE: in selecting the latest version we rely exclusively on the segment
name (not all segment data contain time information).
- reduce(Text, Iterator<NutchWritable>, OutputCollector<Text, Text>, Reporter) - Method in class org.apache.nutch.segment.SegmentReader
-
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.tools.CrawlDBScanner
-
- reduce(Text, Iterator<Generator.SelectorEntry>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.tools.FreeGenerator.FG
-
- reduce(Text, Iterable<LongWritable>, Reducer<Text, LongWritable, Text, LongWritable>.Context) - Method in class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
-
- regexNormalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
This function does the replacements by iterating through all the regex
patterns.
- RegexRule - Class in org.apache.nutch.urlfilter.api
-
A generic regular expression rule.
- RegexRule(boolean, String) - Constructor for class org.apache.nutch.urlfilter.api.RegexRule
-
Constructs a new regular expression rule.
- RegexURLFilter - Class in org.apache.nutch.urlfilter.regex
-
- RegexURLFilter() - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- RegexURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- RegexURLFilterBase - Class in org.apache.nutch.urlfilter.api
-
- RegexURLFilterBase() - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new empty RegexURLFilterBase
- RegexURLFilterBase(File) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and init it with a file of rules.
- RegexURLFilterBase(String) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and inits it with a list of rules.
- RegexURLFilterBase(Reader) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and init it with a Reader of rules.
- RegexURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.regex
-
Allows users to do regex substitutions on all/any URLs that are encountered,
which is useful for stripping session IDs from URLs.
- RegexURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
The default constructor which is called from UrlNormalizerFactory
(normalizerClass.newInstance()) in method: getNormalizer()*
- RegexURLNormalizer(Configuration) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
- RegexURLNormalizer(Configuration, String) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
Constructor which can be passed the file name, so it doesn't look in the
configuration files for it.
- REL_TAG - Static variable in class org.apache.nutch.microformats.reltag.RelTagParser
-
- RELATION - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A reference to a related resource.
- RelTagIndexingFilter - Class in org.apache.nutch.microformats.reltag
-
- RelTagIndexingFilter() - Constructor for class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- RelTagParser - Class in org.apache.nutch.microformats.reltag
-
Adds microformat rel-tags of document if found.
- RelTagParser() - Constructor for class org.apache.nutch.microformats.reltag.RelTagParser
-
- remove(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- remove(String) - Method in class org.apache.nutch.metadata.Metadata
-
Remove a metadata and all its associated values.
- remove(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- removeField(String) - Method in class org.apache.nutch.indexer.NutchDocument
-
- removeLockFile(FileSystem, Path) - Static method in class org.apache.nutch.util.LockUtil
-
Remove lock file.
- replace(FileSystem, Path, Path, boolean) - Static method in class org.apache.nutch.util.FSUtils
-
Replaces the current path with the new path and if set removes the old
path.
- REPR_URL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- reset() - Method in class org.apache.nutch.indexer.NutchField
-
- reset() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets all boolean values to false
.
- resolveEncodingAlias(String) - Static method in class org.apache.nutch.util.EncodingDetector
-
- resolveURL(URL, String) - Static method in class org.apache.nutch.util.URLUtil
-
Resolve relative URL-s and fix a few java.net.URL errors
in handling of URLs with embedded params and pure query
targets.
- ResolveUrls - Class in org.apache.nutch.tools
-
A simple tool that will spin up multiple threads to resolve urls to ip
addresses.
- ResolveUrls(String) - Constructor for class org.apache.nutch.tools.ResolveUrls
-
Create a new ResolveUrls with a file from the local file system.
- ResolveUrls(String, int) - Constructor for class org.apache.nutch.tools.ResolveUrls
-
Create a new ResolveUrls with a urls file and a number of threads for the
Thread pool.
- resolveUrls() - Method in class org.apache.nutch.tools.ResolveUrls
-
Creates a thread pool for resolving urls.
- Response - Interface in org.apache.nutch.net.protocols
-
A response inteface.
- retrieveFile(String, OutputStream, int) - Method in class org.apache.nutch.protocol.ftp.Client
-
- retrieveList(String, List<FTPFile>, int, FTPFileEntryParser) - Method in class org.apache.nutch.protocol.ftp.Client
-
- RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Temporary failure.
- rightPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
-
Returns a copy of s
padded with trailing spaces so
that it's length is length
.
- RIGHTS - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Information about rights held in and over the resource.
- RobotRules - Interface in org.apache.nutch.protocol
-
This class holds the rules which were parsed from a robots.txt file, and can
test paths against those rules.
- RobotRulesParser - Class in org.apache.nutch.protocol
-
This class uses crawler-commons for handling the parsing of robots.txt
files.
- RobotRulesParser() - Constructor for class org.apache.nutch.protocol.RobotRulesParser
-
- RobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.RobotRulesParser
-
- ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Access denied by robots.txt rules.
- root - Variable in class org.apache.nutch.util.TrieStringMatcher
-
- ROUTES_DIR - Static variable in class org.apache.nutch.scoring.webgraph.Loops
-
- run(String[]) - Method in class org.apache.nutch.crawl.Crawl
-
- run(String[]) - Method in class org.apache.nutch.crawl.CrawlDb
-
- run(String[]) - Method in class org.apache.nutch.crawl.CrawlDbMerger
-
- run(String[]) - Method in class org.apache.nutch.crawl.Generator
-
- run(String[]) - Method in class org.apache.nutch.crawl.Injector
-
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDb
-
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- run(RecordReader<Text, CrawlDatum>, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.fetcher.Fetcher
-
- run(String[]) - Method in class org.apache.nutch.fetcher.Fetcher
-
- run(RecordReader<WritableComparable<?>, Writable>, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- run(String[]) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- run(String[]) - Method in class org.apache.nutch.indexer.CleaningJob
-
- run(String[]) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- run(String[]) - Method in class org.apache.nutch.indexer.IndexingJob
-
- run(String[]) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- run(String[]) - Method in class org.apache.nutch.parse.ParserChecker
-
- run(String[]) - Method in class org.apache.nutch.parse.ParseSegment
-
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
Runs the LinkDumper tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkRank
-
Runs the LinkRank tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.Loops
-
Runs the Loops tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
Runs the node dumper tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Runs the ScoreUpdater tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
-
Parses command link arguments and runs the WebGraph jobs.
- run(String[]) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- run(String[]) - Method in class org.apache.nutch.tools.Benchmark
-
- run(String[]) - Method in class org.apache.nutch.tools.CrawlDBScanner
-
- run(String[]) - Method in class org.apache.nutch.tools.FreeGenerator
-
- run(String[]) - Method in class org.apache.nutch.util.domain.DomainStatistics
-
- save() - Method in class org.apache.nutch.collection.CollectionManager
-
Save collections into file
- saveDom(OutputStream, Element) - Static method in class org.apache.nutch.util.DomUtil
-
save dom into ouputstream
- SCHEDULE_DEC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- SCHEDULE_INC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- SCHEDULE_MIME_FILE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- SCOPE_CRAWLDB - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when updating the CrawlDb with new URLs.
- SCOPE_DEFAULT - Static variable in class org.apache.nutch.net.URLNormalizers
-
Default scope.
- SCOPE_FETCHER - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used by
Fetcher
when processing
redirect URLs.
- SCOPE_GENERATE_HOST_COUNT - Static variable in class org.apache.nutch.net.URLNormalizers
-
- SCOPE_INDEXER - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when indexing URLs.
- SCOPE_INJECT - Static variable in class org.apache.nutch.net.URLNormalizers
-
- SCOPE_LINKDB - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when updating the LinkDb with new URLs.
- SCOPE_OUTLINK - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when constructing new
Outlink
instances.
- SCOPE_PARTITION - Static variable in class org.apache.nutch.net.URLNormalizers
-
- SCORE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- ScoreUpdater - Class in org.apache.nutch.scoring.webgraph
-
Updates the score from the WebGraph node database into the crawl database.
- ScoreUpdater() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- ScoringFilter - Interface in org.apache.nutch.scoring
-
A contract defining behavior of scoring plugins.
- ScoringFilterException - Exception in org.apache.nutch.scoring
-
Specialized exception for errors during scoring.
- ScoringFilterException() - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilterException(String) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilterException(String, Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilterException(Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilters - Class in org.apache.nutch.scoring
-
- ScoringFilters(Configuration) - Constructor for class org.apache.nutch.scoring.ScoringFilters
-
- SECONDS_PER_DAY - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
- SEGMENT_NAME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- SegmentHandler - Class in org.apache.nutch.tools.proxy
-
XXX should turn this into a plugin?
- SegmentHandler(Configuration, Path) - Constructor for class org.apache.nutch.tools.proxy.SegmentHandler
-
- SegmentMergeFilter - Interface in org.apache.nutch.segment
-
Interface used to filter segments during segment merge.
- SegmentMergeFilters - Class in org.apache.nutch.segment
-
This class wraps all
SegmentMergeFilter
extensions in a single object
so it is easier to operate on them.
- SegmentMergeFilters(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMergeFilters
-
- SegmentMerger - Class in org.apache.nutch.segment
-
This tool takes several segments and merges their data together.
- SegmentMerger() - Constructor for class org.apache.nutch.segment.SegmentMerger
-
- SegmentMerger(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMerger
-
- SegmentMerger.ObjectInputFormat - Class in org.apache.nutch.segment
-
Wraps inputs in an
MetaWrapper
, to permit merging different
types in reduce and use additional metadata.
- SegmentMerger.ObjectInputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
-
- SegmentMerger.SegmentOutputFormat - Class in org.apache.nutch.segment
-
- SegmentMerger.SegmentOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
-
- segmentName - Variable in class org.apache.nutch.segment.SegmentPart
-
Name of the segment (just the last path component).
- SegmentPart - Class in org.apache.nutch.segment
-
Utility class for handling information about segment parts.
- SegmentPart() - Constructor for class org.apache.nutch.segment.SegmentPart
-
- SegmentPart(String, String) - Constructor for class org.apache.nutch.segment.SegmentPart
-
- SegmentReader - Class in org.apache.nutch.segment
-
Dump the content of a segment.
- SegmentReader() - Constructor for class org.apache.nutch.segment.SegmentReader
-
- SegmentReader(Configuration, boolean, boolean, boolean, boolean, boolean, boolean) - Constructor for class org.apache.nutch.segment.SegmentReader
-
- SegmentReader.InputCompatMapper - Class in org.apache.nutch.segment
-
- SegmentReader.InputCompatMapper() - Constructor for class org.apache.nutch.segment.SegmentReader.InputCompatMapper
-
- SegmentReader.SegmentReaderStats - Class in org.apache.nutch.segment
-
- SegmentReader.SegmentReaderStats() - Constructor for class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- SegmentReader.TextOutputFormat - Class in org.apache.nutch.segment
-
Implements a text output format
- SegmentReader.TextOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentReader.TextOutputFormat
-
- segnum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
-
- sendNoOp() - Method in class org.apache.nutch.protocol.ftp.Client
-
Sends a NOOP command to the FTP server.
- SERVER_URL - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- SERVER_URL - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- set(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Copy the contents of another instance into this instance.
- set(String, String) - Method in class org.apache.nutch.metadata.Metadata
-
Set metadata name/value.
- set(String, String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- setAll(Properties) - Method in class org.apache.nutch.metadata.Metadata
-
Copy All key-value pairs from properties.
- setAnchor(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setArgs(String[]) - Method in class org.apache.nutch.parse.ParseStatus
-
- setArgs(String[]) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setBaseHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the baseHref
.
- setBlackList(String) - Method in class org.apache.nutch.collection.Subcollection
-
Set contents of blacklist from String
- setClazz(String) - Method in class org.apache.nutch.plugin.Extension
-
Sets the Class that implement the concret extension and is only used until
model creation at system start up.
- setCode(int) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setCommand(String) - Method in class org.apache.nutch.util.CommandRunner
-
- setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.Signature
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.indexer.CleaningJob
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
handles conf assignment and pulls the value assignment from the
"urlmeta.tags" property
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
-
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
-
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
-
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.ext.ExtParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.feed.FeedParser
-
Sets the
Configuration
object for this
Parser
.
- setConf(Configuration) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.html.HtmlParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.MetaTagsParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.ParserChecker
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.swf.SWFParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.tika.TikaParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.zip.ZipParser
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.file.File
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.http.Http
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.Http
-
Reads the configuration from the Nutch configuration files and sets
the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
handles conf assignment and pulls the value assignment from the
"urlmeta.tags" property
- setConf(Configuration) - Method in class org.apache.nutch.segment.SegmentMerger
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
Sets the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
Sets the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
-
- setConf(Configuration) - Method in class org.apache.nutch.util.GenericWritableConfigurable
-
- setConf(Configuration) - Method in class org.creativecommons.nutch.CCIndexingFilter
-
- setConf(Configuration) - Method in class org.creativecommons.nutch.CCParseFilter
-
- setContent(byte[]) - Method in class org.apache.nutch.protocol.Content
-
- setContent(Content) - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- setContentType(String) - Method in class org.apache.nutch.protocol.Content
-
- setDataTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Client
-
Sets the timeout in milliseconds to use for data connection.
- setDescriptor(PluginDescriptor) - Method in class org.apache.nutch.plugin.Extension
-
Sets the plugin descriptor and is only used until model creation at system
start up.
- setDocumentLocator(Locator) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive an object for locating the origin of SAX document events.
- setFetchInterval(int) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setFetchInterval(float) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
Sets the fetchInterval
and fetchTime
on a
successfully fetched page.
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.DefaultFetchSchedule
-
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Sets the fetchInterval
and fetchTime
on a
successfully fetched page.
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- setFetchTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Sets either the time of the last fetch or the next fetch time,
depending on whether Fetcher or CrawlDbReducer set the time.
- setFileType(int) - Method in class org.apache.nutch.protocol.ftp.Client
-
Sets the file type to be transferred.
- setFilterFromPath(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setFollowTalk(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set followTalk
- setFound(boolean) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- setId(String) - Method in class org.apache.nutch.plugin.Extension
-
Sets the unique extension Id and is only used until model creation at
system start up.
- setIDAttribute(String, Element) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Set an ID string to node association in the ID table.
- setIgnoreCase(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setInlinkScore(float) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setInputStream(InputStream) - Method in class org.apache.nutch.util.CommandRunner
-
- setKeepConnection(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set keepConnection
- setLastModified(long) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setLinks(LinkDumper.LinkNode[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- setLinkType(byte) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setLookingFor(String) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- setLoopSet(Set<String>) - Method in class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- setMajorCode(byte) - Method in class org.apache.nutch.parse.ParseStatus
-
- setMaxContentLength(int) - Method in class org.apache.nutch.protocol.file.File
-
Set the length after at which content is truncated.
- setMaxContentLength(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the point at which content is truncated.
- setMessage(String) - Method in class org.apache.nutch.parse.ParseStatus
-
- setMessage(String) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setMeta(String, String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Set metadata.
- setMetaData(MapWritable) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setMetadata(Metadata) - Method in class org.apache.nutch.protocol.Content
-
Other protocol-specific data.
- setMetadata(Metadata) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setMinorCode(short) - Method in class org.apache.nutch.parse.ParseStatus
-
- setModeAccept(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setModifiedTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets noCache
to true
.
- setNode(Node) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- setNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets noFollow
to true
.
- setNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets noIndex
to true
.
- setNumInlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setNumOutlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setObject(String, Object) - Method in class org.apache.nutch.util.ObjectCache
-
- setOutlinks(Outlink[]) - Method in class org.apache.nutch.parse.ParseData
-
- setOutlinkUrl(String) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method specifies how to schedule refetching of pages
marked as GONE.
- setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method specifies how to schedule refetching of pages
marked as GONE.
- setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
- setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
- setParseMeta(Metadata) - Method in class org.apache.nutch.parse.ParseData
-
- setRefresh(boolean) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets refresh
to the supplied value.
- setRefreshHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the refreshHref
.
- setRefreshTime(int) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the refreshTime
.
- setRemoteVerificationEnabled(boolean) - Method in class org.apache.nutch.protocol.ftp.Client
-
Enable or disable verification that the remote host taking part
of a data connection is the same as the host to which the control
connection is attached.
- setRetriesSinceFetch(int) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setScore(float) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setScore(float) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setSignature(byte[]) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setStatus(int) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setStatus(ProtocolStatus) - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- setStdErrorStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
-
- setStdOutputStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
-
- setTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the timeout.
- setTimeout(int) - Method in class org.apache.nutch.util.CommandRunner
-
- setTimestamp(long) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setUrl(String) - Method in class org.apache.nutch.parse.Outlink
-
- setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- setWaitForExit(boolean) - Method in class org.apache.nutch.util.CommandRunner
-
- setWeight(float) - Method in class org.apache.nutch.indexer.NutchDocument
-
- setWeight(float) - Method in class org.apache.nutch.indexer.NutchField
-
- setWhiteList(ArrayList) - Method in class org.apache.nutch.collection.Subcollection
-
- setWhiteList(String) - Method in class org.apache.nutch.collection.Subcollection
-
Set contents of whitelist from String
- shortestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns the shortest prefix of input that is matched,
or null if no match exists.
- shortestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns the shortest suffix of input that is matched,
or null if no match exists.
- shortestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the shortest substring of input that is
matched by a pattern in the trie, or null if no match
exists.
- shouldFetch(Text, CrawlDatum, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method provides information whether the page is suitable for
selection in the current fetchlist.
- shouldFetch(Text, CrawlDatum, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method provides information whether the page is suitable for
selection in the current fetchlist.
- shutDown() - Method in class org.apache.nutch.plugin.Plugin
-
Shutdown the plugin.
- Signature - Class in org.apache.nutch.crawl
-
- Signature() - Constructor for class org.apache.nutch.crawl.Signature
-
- SIGNATURE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- SignatureComparator - Class in org.apache.nutch.crawl
-
- SignatureComparator() - Constructor for class org.apache.nutch.crawl.SignatureComparator
-
- SignatureFactory - Class in org.apache.nutch.crawl
-
Factory class, which instantiates a Signature implementation according to the
current Configuration configuration.
- size() - Method in class org.apache.nutch.crawl.Inlinks
-
- size() - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- size() - Method in class org.apache.nutch.metadata.Metadata
-
Returns the number of metadata names in this metadata.
- size() - Method in class org.apache.nutch.parse.ParseResult
-
Return the number of parse outputs (both successful and failed)
- skip(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
-
Skips over one Inlink in the input.
- skip(DataInput) - Static method in class org.apache.nutch.parse.Outlink
-
Skips over one Outlink in the input.
- SKIP_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseSegment
-
- skipChildren() - Method in class org.apache.nutch.util.NodeWalker
-
Skips over and removes from the node stack the children of the last
node.
- skippedEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of a skipped entity.
- SOLR_PREFIX - Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- SOLR_PREFIX - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- SolrConstants - Interface in org.apache.nutch.indexer.solr
-
- SolrConstants - Interface in org.apache.nutch.indexwriter.solr
-
- SolrDeleteDuplicates - Class in org.apache.nutch.indexer.solr
-
Utility class for deleting duplicate documents from a solr index.
- SolrDeleteDuplicates() - Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- SolrDeleteDuplicates.SolrInputFormat - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrInputFormat() - Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputFormat
-
- SolrDeleteDuplicates.SolrInputSplit - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrInputSplit() - Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- SolrDeleteDuplicates.SolrInputSplit(int, int) - Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- SolrDeleteDuplicates.SolrRecord - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrRecord() - Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrDeleteDuplicates.SolrRecord(SolrDeleteDuplicates.SolrRecord) - Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrDeleteDuplicates.SolrRecord(String, float, long) - Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrIndexWriter - Class in org.apache.nutch.indexwriter.solr
-
- SolrIndexWriter() - Constructor for class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- SolrMappingReader - Class in org.apache.nutch.indexwriter.solr
-
- SolrMappingReader(Configuration) - Constructor for class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- SolrUtils - Class in org.apache.nutch.indexer.solr
-
- SolrUtils() - Constructor for class org.apache.nutch.indexer.solr.SolrUtils
-
- SolrUtils - Class in org.apache.nutch.indexwriter.solr
-
- SolrUtils() - Constructor for class org.apache.nutch.indexwriter.solr.SolrUtils
-
- SOURCE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A reference to a resource from which the present resource is derived.
- SpellCheckedMetadata - Class in org.apache.nutch.metadata
-
A decorator to Metadata that adds spellchecking capabilities to property
names.
- SpellCheckedMetadata() - Constructor for class org.apache.nutch.metadata.SpellCheckedMetadata
-
- splitEnd - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitStart - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- start - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- startCDATA() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the start of a CDATA section.
- startDocument() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the beginning of a document.
- startDTD(String, String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the start of DTD declarations, if any.
- startElement(String, String, String, Attributes) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the beginning of an element.
- startEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the beginning of an entity.
- startPrefixMapping(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Begin the scope of a prefix-URI Namespace mapping.
- startUp() - Method in class org.apache.nutch.plugin.Plugin
-
Will be invoked until plugin start up.
- StaticFieldIndexer - Class in org.apache.nutch.indexer.staticfield
-
A simple plugin called at indexing that adds fields with static data.
- StaticFieldIndexer() - Constructor for class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
- statNames - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- STATUS_BLOCKED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_DB_FETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was successfully fetched.
- STATUS_DB_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page no longer exists.
- STATUS_DB_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Maximum value of DB-related status.
- STATUS_DB_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was successfully fetched and found not modified.
- STATUS_DB_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page permanently redirects to other page.
- STATUS_DB_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page temporarily redirects to other page.
- STATUS_DB_UNFETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was not fetched yet.
- STATUS_FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_FAILURE - Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_FETCH_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching unsuccessful - page is gone.
- STATUS_FETCH_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Maximum value of fetch-related status.
- STATUS_FETCH_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching successful - page is not modified.
- STATUS_FETCH_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching permanently redirected to other page.
- STATUS_FETCH_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching temporarily redirected to other page.
- STATUS_FETCH_RETRY - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching unsuccessful, needs to be retried (transient errors).
- STATUS_FETCH_SUCCESS - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching was successful.
- STATUS_GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_INJECTED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was newly injected.
- STATUS_LINKED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page discovered through a link.
- STATUS_MODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
Page is known to have been modified since our last visit.
- STATUS_NOTFETCHING - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTFOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTMODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
Page is known to remain unmodified since our last visit.
- STATUS_NOTMODIFIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTPARSED - Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_PARSE_META - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page got metadata from a parser
- STATUS_REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_SIGNATURE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page signature.
- STATUS_SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_UNKNOWN - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
It is unknown whether page was changed since our last visit.
- STATUS_WOULDBLOCK - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- StringUtil - Class in org.apache.nutch.util
-
A collection of String processing utility methods.
- StringUtil() - Constructor for class org.apache.nutch.util.StringUtil
-
- stripNonCharCodepoints(String) - Static method in class org.apache.nutch.indexer.solr.SolrUtils
-
- stripNonCharCodepoints(String) - Static method in class org.apache.nutch.indexwriter.solr.SolrUtils
-
- Subcollection - Class in org.apache.nutch.collection
-
SubCollection represents a subset of index, you can define url patterns that
will indicate that particular page (url) is part of SubCollection.
- Subcollection(String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
public Constructor
- Subcollection(String, String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
public Constructor
- Subcollection(Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
- SubcollectionIndexingFilter - Class in org.apache.nutch.indexer.subcollection
-
- SubcollectionIndexingFilter() - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SubcollectionIndexingFilter(Configuration) - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SUBJECT - Static variable in interface org.apache.nutch.metadata.DublinCore
-
The topic of the content of the resource.
- SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing succeeded.
- SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Content was retrieved without errors.
- SUCCESS_REDIRECT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsed content contains a directive to redirect to another URL.
- SuffixStringMatcher - Class in org.apache.nutch.util
-
A class for efficiently matching String
s against a set
of suffixes.
- SuffixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any suffix in the supplied array.
- SuffixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any suffix in the supplied
Collection
- SuffixURLFilter - Class in org.apache.nutch.urlfilter.suffix
-
Filters URLs based on a file of URL suffixes.
- SuffixURLFilter() - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SuffixURLFilter(Reader) - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SWFParser - Class in org.apache.nutch.parse.swf
-
Parser for Flash SWF files.
- SWFParser() - Constructor for class org.apache.nutch.parse.swf.SWFParser
-