Class TrecDocParser
- java.lang.Object
-
- org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
-
- Direct Known Subclasses:
TrecFBISParser,TrecFR94Parser,TrecFTParser,TrecGov2Parser,TrecLATimesParser,TrecParserByPath
public abstract class TrecDocParser extends Object
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which are handled in TrecContentSource. Required to be stateless and hence thread safe.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classTrecDocParser.ParsePathTypeTypes of trec parse paths,
-
Field Summary
Fields Modifier and Type Field Description static TrecDocParser.ParsePathTypeDEFAULT_PATH_TYPEtrec parser type used for unknown extensions
-
Constructor Summary
Constructors Constructor Description TrecDocParser()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description static Stringextract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)Extract frombufthe text of interest within specified tagsabstract DocDataparse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType)parse the text prepared in docBuf into a result DocData, no synchronization is required.static TrecDocParser.ParsePathTypepathType(Path f)Compute the path type of a file by inspecting name of file and its parentsstatic StringstripTags(StringBuilder buf, int start)strip tags frombuf: each tag is replaced by a single blank.static StringstripTags(String buf, int start)strip tags from input.
-
-
-
Field Detail
-
DEFAULT_PATH_TYPE
public static final TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
-
-
Method Detail
-
pathType
public static TrecDocParser.ParsePathType pathType(Path f)
Compute the path type of a file by inspecting name of file and its parents
-
parse
public abstract DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws IOException
parse the text prepared in docBuf into a result DocData, no synchronization is required.- Parameters:
docData- reusable resultname- name that should be set to the resulttrecSrc- calling trec content sourcedocBuf- text to parsepathType- type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.- Throws:
IOException
-
stripTags
public static String stripTags(StringBuilder buf, int start)
strip tags frombuf: each tag is replaced by a single blank.- Returns:
- text obtained when stripping all tags from
buf(Input StringBuilder is unmodified).
-
stripTags
public static String stripTags(String buf, int start)
strip tags from input.- See Also:
stripTags(StringBuilder, int)
-
extract
public static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
Extract frombufthe text of interest within specified tags- Parameters:
buf- entire input textstartTag- tag marking start of text of interestendTag- tag marking end of text of interestmaxPos- if ≥ 0 sets a limit on start of text of interest- Returns:
- text of interest or null if not found
-
-