Class BM25Similarity
java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.BM25Similarity
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline
Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text
REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer -
Constructor Summary
ConstructorsConstructorDescriptionBM25 with these default values:k1 = 1.2b = 0.75discountOverlaps = trueBM25Similarity(boolean discountOverlaps) BM25 with these default values:k1 = 1.2b = 0.75and the supplied parameter value:BM25Similarity(float k1, float b) BM25 with the supplied parameter values.BM25Similarity(float k1, float b, boolean discountOverlaps) BM25 with the supplied parameter values. -
Method Summary
Modifier and TypeMethodDescriptionprotected floatavgFieldLength(CollectionStatistics collectionStats) The default implementation computes the average assumTotalTermFreq / docCountfinal floatgetB()Returns thebparameterfinal floatgetK1()Returns thek1parameterprotected floatidf(long docFreq, long docCount) Implemented aslog(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5)).idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) Computes a score factor for a simple term and returns an explanation for that score factor.idfExplain(CollectionStatistics collectionStats, TermStatistics[] termStats) Computes a score factor for a phrase.final Similarity.SimScorerscorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) Compute any collection-level weight (e.g.toString()Methods inherited from class org.apache.lucene.search.similarities.Similarity
computeNorm, getDiscountOverlaps
-
Constructor Details
-
BM25Similarity
public BM25Similarity(float k1, float b, boolean discountOverlaps) BM25 with the supplied parameter values.- Parameters:
k1- Controls non-linear term frequency normalization (saturation).b- Controls to what degree document length normalizes tf values.discountOverlaps- True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.- Throws:
IllegalArgumentException- ifk1is infinite or negative, or ifbis not within the range[0..1]
-
BM25Similarity
public BM25Similarity(float k1, float b) BM25 with the supplied parameter values.- Parameters:
k1- Controls non-linear term frequency normalization (saturation).b- Controls to what degree document length normalizes tf values.- Throws:
IllegalArgumentException- ifk1is infinite or negative, or ifbis not within the range[0..1]
-
BM25Similarity
public BM25Similarity(boolean discountOverlaps) BM25 with these default values:k1 = 1.2b = 0.75
- Parameters:
discountOverlaps- True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
-
BM25Similarity
public BM25Similarity()BM25 with these default values:k1 = 1.2b = 0.75discountOverlaps = true
-
-
Method Details
-
idf
protected float idf(long docFreq, long docCount) Implemented aslog(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5)). -
avgFieldLength
The default implementation computes the average assumTotalTermFreq / docCount -
idfExplain
Computes a score factor for a simple term and returns an explanation for that score factor.The default implementation uses:
idf(docFreq, docCount);
Note thatCollectionStatistics.docCount()is used instead ofIndexReader#numDocs()because alsoTermStatistics.docFreq()is used, and when the latter is inaccurate, so isCollectionStatistics.docCount(), and in the same direction. In addition,CollectionStatistics.docCount()does not skew when fields are sparse.- Parameters:
collectionStats- collection-level statisticstermStats- term-level statistics for the term- Returns:
- an Explain object that includes both an idf score factor and an explanation for the term.
-
idfExplain
Computes a score factor for a phrase.The default implementation sums the idf factor for each term in the phrase.
- Parameters:
collectionStats- collection-level statisticstermStats- term-level statistics for the terms in the phrase- Returns:
- an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
-
scorer
public final Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) Description copied from class:SimilarityCompute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.- Specified by:
scorerin classSimilarity- Parameters:
boost- a multiplicative factor to apply to the produces scorescollectionStats- collection-level statistics, such as the number of tokens in the collection.termStats- term-level statistics, such as the document frequency of a term across the collection.- Returns:
- SimWeight object with the information this Similarity needs to score a query.
-
toString
-
getK1
public final float getK1()Returns thek1parameter- See Also:
-
getB
public final float getB()Returns thebparameter- See Also:
-