Class TrecDocParser
java.lang.Object
org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
- Direct Known Subclasses:
TrecFBISParser,TrecFR94Parser,TrecFTParser,TrecGov2Parser,TrecLATimesParser,TrecParserByPath
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which
are handled in TrecContentSource. Required to be stateless and hence thread safe.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumTypes of trec parse paths, -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final TrecDocParser.ParsePathTypetrec parser type used for unknown extensions -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic Stringextract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes) Extract frombufthe text of interest within specified tagsabstract DocDataparse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) parse the text prepared in docBuf into a result DocData, no synchronization is required.static TrecDocParser.ParsePathTypeCompute the path type of a file by inspecting name of file and its parentsstatic StringstripTags(StringBuilder buf, int start) strip tags frombuf: each tag is replaced by a single blank.static Stringstrip tags from input.
-
Field Details
-
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
-
-
Constructor Details
-
TrecDocParser
public TrecDocParser()
-
-
Method Details
-
pathType
Compute the path type of a file by inspecting name of file and its parents -
parse
public abstract DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws IOException parse the text prepared in docBuf into a result DocData, no synchronization is required.- Parameters:
docData- reusable resultname- name that should be set to the resulttrecSrc- calling trec content sourcedocBuf- text to parsepathType- type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.- Throws:
IOException
-
stripTags
strip tags frombuf: each tag is replaced by a single blank.- Returns:
- text obtained when stripping all tags from
buf(Input StringBuilder is unmodified).
-
stripTags
strip tags from input.- See Also:
-
extract
public static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes) Extract frombufthe text of interest within specified tags- Parameters:
buf- entire input textstartTag- tag marking start of text of interestendTag- tag marking end of text of interestmaxPos- if ≥ 0 sets a limit on start of text of interest- Returns:
- text of interest or null if not found
-