My use case is to reindexing the data if any index got corrupted. I am able
to read the data of a shard by using lucene's java api IndexReader. But
using this api we can read data only if index is in stable state.
If any file in index shard directory got corrupt or got delete then that
shard goes to UNASSIGNED state. In this state i am not able to read the
data. Lucene also provides an api which check and fix the index if index
is corrupt but that simply remove reference of corrupt file's segment from
segment file of index.That means loss of data. We do not want that.
I am storing all fields in elasticsearch. As all fields are stored in only
.fdx ad .fdt files so we should be able to read data from .fdx and .fdt
files.
I referred below links but not got much success.
My use case is to reindexing the data if any index got corrupted. I am
able to read the data of a shard by using lucene's java api IndexReader.
But using this api we can read data only if index is in stable state.
If any file in index shard directory got corrupt or got delete then that
shard goes to UNASSIGNED state. In this state i am not able to read the
data. Lucene also provides an api which check and fix the index if index
is corrupt but that simply remove reference of corrupt file's segment from
segment file of index.That means loss of data. We do not want that.
I am storing all fields in elasticsearch. As all fields are stored in only
.fdx ad .fdt files so we should be able to read data from .fdx and .fdt
files.
I referred below links but not got much success.
I second Jörg on the fact that replicas are the best way to handle this
kind of situation.
In case you don't have replicas, you are correct it is theoretically
possible to read stored fields alone. In addition to the .fdt and .fdx
files, you will also need the segment infos file to know the number of
documents in your segment and the field infos file: this file maps field
names to numbers and these numbers are used in the fdt file to refer to
fields instead of their names.
Here is a gist[1] which uses Lucene's codec API to read the stored fields
of an index even if other parts of the index are corrupted.
My use case is to reindexing the data if any index got corrupted. I am
able to read the data of a shard by using lucene's java api *IndexReader
*. But using this api we can read data only if index is in stable
state.
If any file in index shard directory got corrupt or got delete then that
shard goes to UNASSIGNED state. In this state i am not able to read the
data. Lucene also provides an api which check and fix the index if index
is corrupt but that simply remove reference of corrupt file's segment from
segment file of index.That means loss of data. We do not want that.
I am storing all fields in elasticsearch. As all fields are stored in
only .fdx ad .fdt files so we should be able to read data from .fdx and
.fdt files.
I referred below links but not got much success.
I second Jörg on the fact that replicas are the best way to handle this
kind of situation.
In case you don't have replicas, you are correct it is theoretically
possible to read stored fields alone. In addition to the .fdt and .fdx
files, you will also need the segment infos file to know the number of
documents in your segment and the field infos file: this file maps field
names to numbers and these numbers are used in the fdt file to refer to
fields instead of their names.
Here is a gist[1] which uses Lucene's codec API to read the stored fields
of an index even if other parts of the index are corrupted.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.