public class UpdateIndex
extends Object
A distributed "index" is partitioned into "shards". Each shard corresponds
to a Lucene instance. This class contains the main() method which uses a
Map/Reduce job to analyze documents and update Lucene instances in parallel.
The main() method in UpdateIndex requires the following information for
updating the shards:
- Input formatter. This specifies how to format the input documents.
- Analysis. This defines the analyzer to use on the input. The analyzer
determines whether a document is being inserted, updated, or deleted.
For inserts or updates, the analyzer also converts each input document
into a Lucene document.
- Input paths. This provides the location(s) of updated documents,
e.g., HDFS files or directories, or HBase tables.
- Shard paths, or index path with the number of shards. Either specify
the path for each shard, or specify an index path and the shards are
the sub-directories of the index directory.
- Output path. When the update to a shard is done, a message is put here.
- Number of map tasks.
All of the information can be specified in a configuration file. All but
the first two can also be specified as command line options. Check out
conf/index-config.xml.template for other configurable parameters.
Note: Because of the parallel nature of Map/Reduce, the behaviour of
multiple inserts, deletes or updates to the same document is undefined.