Read data from .fdx and .fdt files


(Hanish Bansal) #1

Hi All,

My use case is to reindexing the data if any index got corrupted. I am able
to read the data of a shard by using lucene's java api IndexReader. But
using this api we can read data only if index is in stable state.
If any file in index shard directory got corrupt or got delete then that
shard goes to UNASSIGNED state. In this state i am not able to read the
data. Lucene also provides an api which check and fix the index if index
is corrupt but that simply remove reference of corrupt file's segment from
segment file of index.That means loss of data. We do not want that.

I am storing all fields in elasticsearch. As all fields are stored in only
.fdx ad .fdt files so we should be able to read data from .fdx and .fdt
files.
I referred below links but not got much success.

http://pastebin.com/nmF0j4np#

http://search.cpan.org/~creamyg/KinoSearch-0.165/lib/KinoSearch/Docs/FileFormat.pod#Stored_fieldshttps://mail1.impetus.co.in/owa/redir.aspx?C=7xYuD8iCNUufZrU8moOWEPTIb4RPftAIlyX2h-LV7ziuB6ZPBXJujcsh2nbuS1EG2GVF9NlF974.&URL=http%3A%2F%2Fsearch.cpan.org%2F~creamyg%2FKinoSearch-0.165%2Flib%2FKinoSearch%2FDocs%2FFileFormat.pod%23Stored_fields

Somebody posted a tool to recover data from .fdt files as mentioned in
https://issues.apache.org/jira/browse/LUCENE-4706 but it is also not able
to read if any file in shard got corrupt.

Request to share your views..

--
Thanks & Regards
Hanish Bansal

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

The method ES supports to repair shards is setting up replica level. Do you
have a copy of the shard?

If not, you are correct, all that is left are damaged Lucene files, afaik
without tool support to reconstruct (parts of) the index.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #3

On Fri, Sep 6, 2013 at 7:32 AM, Hanish Bansal <
hanish.bansal.agarwal@gmail.com> wrote:

Hi All,

Hi,

My use case is to reindexing the data if any index got corrupted. I am
able to read the data of a shard by using lucene's java api IndexReader.
But using this api we can read data only if index is in stable state.
If any file in index shard directory got corrupt or got delete then that
shard goes to UNASSIGNED state. In this state i am not able to read the
data. Lucene also provides an api which check and fix the index if index
is corrupt but that simply remove reference of corrupt file's segment from
segment file of index.That means loss of data. We do not want that.

I am storing all fields in elasticsearch. As all fields are stored in only
.fdx ad .fdt files so we should be able to read data from .fdx and .fdt
files.
I referred below links but not got much success.

http://pastebin.com/nmF0j4np#

http://search.cpan.org/~creamyg/KinoSearch-0.165/lib/KinoSearch/Docs/FileFormat.pod#Stored_fieldshttps://mail1.impetus.co.in/owa/redir.aspx?C=7xYuD8iCNUufZrU8moOWEPTIb4RPftAIlyX2h-LV7ziuB6ZPBXJujcsh2nbuS1EG2GVF9NlF974.&URL=http%3A%2F%2Fsearch.cpan.org%2F~creamyg%2FKinoSearch-0.165%2Flib%2FKinoSearch%2FDocs%2FFileFormat.pod%23Stored_fields

Somebody posted a tool to recover data from .fdt files as mentioned in
https://issues.apache.org/jira/browse/LUCENE-4706 but it is also not able
to read if any file in shard got corrupt.

I second Jörg on the fact that replicas are the best way to handle this
kind of situation.

In case you don't have replicas, you are correct it is theoretically
possible to read stored fields alone. In addition to the .fdt and .fdx
files, you will also need the segment infos file to know the number of
documents in your segment and the field infos file: this file maps field
names to numbers and these numbers are used in the fdt file to refer to
fields instead of their names.

Here is a gist[1] which uses Lucene's codec API to read the stored fields
of an index even if other parts of the index are corrupted.

[1] https://gist.github.com/jpountz/6461246

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Hanish Bansal) #4

Thanks for your response !!

I do not have any replicas.

Awesome, using this https://gist.github.com/jpountz/6461246 i am able to
read data from .fdt files. Now if my .fdt, .fdx, .fnm and .si files are not
corrupted then i can read all data.

Thank again Adrien :slight_smile:

On Fri, Sep 6, 2013 at 2:26 PM, Adrien Grand <adrien.grand@elasticsearch.com

wrote:

On Fri, Sep 6, 2013 at 7:32 AM, Hanish Bansal <
hanish.bansal.agarwal@gmail.com> wrote:

Hi All,

Hi,

My use case is to reindexing the data if any index got corrupted. I am
able to read the data of a shard by using lucene's java api *IndexReader
*. But using this api we can read data only if index is in stable
state.
If any file in index shard directory got corrupt or got delete then that
shard goes to UNASSIGNED state. In this state i am not able to read the
data. Lucene also provides an api which check and fix the index if index
is corrupt but that simply remove reference of corrupt file's segment from
segment file of index.That means loss of data. We do not want that.

I am storing all fields in elasticsearch. As all fields are stored in
only .fdx ad .fdt files so we should be able to read data from .fdx and
.fdt files.
I referred below links but not got much success.

http://pastebin.com/nmF0j4np#

http://search.cpan.org/~creamyg/KinoSearch-0.165/lib/KinoSearch/Docs/FileFormat.pod#Stored_fieldshttps://mail1.impetus.co.in/owa/redir.aspx?C=7xYuD8iCNUufZrU8moOWEPTIb4RPftAIlyX2h-LV7ziuB6ZPBXJujcsh2nbuS1EG2GVF9NlF974.&URL=http%3A%2F%2Fsearch.cpan.org%2F~creamyg%2FKinoSearch-0.165%2Flib%2FKinoSearch%2FDocs%2FFileFormat.pod%23Stored_fields

Somebody posted a tool to recover data from .fdt files as mentioned in
https://issues.apache.org/jira/browse/LUCENE-4706 but it is also not
able to read if any file in shard got corrupt.

I second Jörg on the fact that replicas are the best way to handle this
kind of situation.

In case you don't have replicas, you are correct it is theoretically
possible to read stored fields alone. In addition to the .fdt and .fdx
files, you will also need the segment infos file to know the number of
documents in your segment and the field infos file: this file maps field
names to numbers and these numbers are used in the fdt file to refer to
fields instead of their names.

Here is a gist[1] which uses Lucene's codec API to read the stored fields
of an index even if other parts of the index are corrupted.

[1] https://gist.github.com/jpountz/6461246

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Thanks & Regards
Hanish Bansal

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5