Apache Lucene is a high-performance, full-featured text search engine library. Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):
Analyzer analyzer = new StandardAnalyzer();
Path indexPath = Files.createTempDirectory("tempIndex");
Directory directory = FSDirectory.open(indexPath);
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser("fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
assertEquals(1, hits.length);
// Iterate through the results:
StoredFields storedFields = isearcher.storedFields();
for (int i = 0; i < hits.length; i++) {
Document hitDoc = storedFields.document(hits[i].doc);
assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
}
ireader.close();
directory.close();
IOUtils.rm(indexPath);
The Lucene API is divided into several packages:
-
org.apache.lucene.analysisdefines an abstractAnalyzerAPI for converting text from aReaderinto aTokenStream, an enumeration of tokenAttributes. A TokenStream can be composed by applyingTokenFilters to the output of aTokenizer. Tokenizers and TokenFilters are strung together and applied with anAnalyzer. analysis-common provides a number of Analyzer implementations, including StopAnalyzer and the grammar-based StandardAnalyzer. -
org.apache.lucene.codecsprovides an abstraction over the encoding and decoding of the inverted index structure, as well as different implementations that can be chosen depending upon application needs. -
org.apache.lucene.documentprovides a simpleDocumentclass. A Document is simply a set of namedFields, whose values may be strings or instances ofReader. -
org.apache.lucene.indexprovides two primary classes:IndexWriter, which creates and adds documents to indices; andIndexReader, which accesses the data in the index. -
org.apache.lucene.searchprovides data structures to represent queries (ieTermQueryfor individual words,PhraseQueryfor phrases, andBooleanQueryfor boolean combinations of queries) and theIndexSearcherwhich turns queries intoTopDocs. A number of QueryParsers are provided for producing query structures from strings or xml. -
org.apache.lucene.storedefines an abstract class for storing persistent data, theDirectory, which is a collection of named files written by anIndexOutputand read by anIndexInput. Multiple implementations are provided, butFSDirectoryis generally recommended as it tries to use operating system disk buffer caches efficiently. -
org.apache.lucene.utilcontains a few handy data structures and util classes, ieFixedBitSetandPriorityQueue.
-
Create
Documents by addingFields; -
Create an
IndexWriterand add documents to it withaddDocument(); - Call QueryParser.parse() to build a query from a string; and
-
Create an
IndexSearcherand pass the query to itssearch()method.
- IndexFiles.java creates an index for all the files contained in a directory.
- SearchFiles.java prompts for queries and searches an index.
> java -cp lucene-core.jar:lucene-demo.jar:lucene-analysis-common.jar org.apache.lucene.demo.IndexFiles -index index -docs rec.food.recipes/soupsadding rec.food.recipes/soups/abalone-chowder[ ... ]
> java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analysis-common.jar org.apache.lucene.demo.SearchFilesQuery: chowderSearching for: chowder34 total matching documents1. rec.food.recipes/soups/spam-chowder[ ... thirty-four documents contain the word "chowder" ... ]
Query: "clam chowder" AND ManhattanSearching for: +"clam chowder" +manhattan2 total matching documents1. rec.food.recipes/soups/clam-chowder[ ... two documents contain the phrase "clam chowder" and the word "manhattan" ... ]
[ Note: "+" and "-" are canonical, but "AND", "OR" and "NOT" may be used. ]
| Package | Description |
|---|---|
| org.apache.lucene.analysis |
Text analysis.
|
| org.apache.lucene.analysis.standard |
Fast, general-purpose grammar-based tokenizer
StandardTokenizer implements the Word Break rules from the
Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. |
| org.apache.lucene.analysis.tokenattributes |
General-purpose attributes for text analysis.
|
| org.apache.lucene.codecs |
Codecs API: API for customization of the encoding and structure of the index.
|
| org.apache.lucene.codecs.compressing |
Compressing helper classes.
|
| org.apache.lucene.codecs.hnsw |
HNSW vector helper classes.
|
| org.apache.lucene.codecs.lucene90 |
Lucene 9.0 file format.
|
| org.apache.lucene.codecs.lucene90.blocktree |
BlockTree terms dictionary.
|
| org.apache.lucene.codecs.lucene90.compressing |
Lucene 9.0 compressing format.
|
| org.apache.lucene.codecs.lucene912 |
Lucene 9.12 file format.
|
| org.apache.lucene.codecs.lucene94 |
Lucene 9.4 file format.
|
| org.apache.lucene.codecs.lucene95 |
Lucene 9.5 file format.
|
| org.apache.lucene.codecs.lucene99 |
Lucene 9.9 file format.
|
| org.apache.lucene.codecs.perfield |
Postings format that can delegate to different formats per-field.
|
| org.apache.lucene.document |
The logical representation of a
Document for indexing and
searching. |
| org.apache.lucene.geo |
Geospatial Utility Implementations for Lucene Core
|
| org.apache.lucene.index |
Code to maintain and access indices.
|
| org.apache.lucene.internal.hppc |
Internal copy of a subset of classes from the HPPC library.
|
| org.apache.lucene.internal.tests |
Internal bridges to package-private internals, for use by the lucene test framework only.
|
| org.apache.lucene.internal.vectorization |
Internal implementations to support SIMD vectorization.
|
| org.apache.lucene.search |
Code to search indices.
|
| org.apache.lucene.search.comparators |
Comparators, used to compare hits so as to determine their sort order when collecting the top
results with
TopFieldCollector. |
| org.apache.lucene.search.knn |
Classes related to vector search: knn and vector fields.
|
| org.apache.lucene.search.similarities |
This package contains the various ranking models that can be used in Lucene.
|
| org.apache.lucene.store |
Binary i/o API, used for all index data.
|
| org.apache.lucene.util |
Some utility classes.
|
| org.apache.lucene.util.automaton |
Finite-state automaton for regular expressions.
|
| org.apache.lucene.util.bkd |
Block KD-tree, implementing the generic spatial data structure described in this paper.
|
| org.apache.lucene.util.compress |
Compression utilities.
|
| org.apache.lucene.util.fst |
Finite state transducers
|
| org.apache.lucene.util.graph |
Utility classes for working with token streams as graphs.
|
| org.apache.lucene.util.hnsw |
Navigable Small-World graph, nominally Hierarchical but currently only has a single layer.
|
| org.apache.lucene.util.mutable |
Comparable object wrappers
|
| org.apache.lucene.util.packed |
Packed integer arrays and streams.
|
| org.apache.lucene.util.quantization |
Provides quantization methods for scaling vector values to smaller data types and possibly fewer
dimensions
|