Indexes seem corrupted

John_Chang · November 17, 2010, 5:33pm

We are worried are indexes are corrupted for a number of reasons. We are looking through the logs to see what might have happened, but are still without a grasp on it. Any advice on understanding, trouble-shooting, and preventing what we are seeing would be greatly appreciated. Thanks.

We keep 4 document types; they all used to have desired mappings, now 2 of the 4 seem to be missing the mappings. Our system maps all 4 types at once and we are confident those mappings used to be there for all types.
We lost a lot of documents; we do a count, and there are fraction remaining of what used to be there.
We are getting an error we've never seen before (see below). The document type in question here does still seem to have the correct mappings.

[Failed to execute main query]]; nested: CompileException[[Error: Invalid shift value in prefixCoded string (is encoded value really an INT?)]\n[Near : {... Unknown ....}]\n ^\n[Line: 1, Column: 0]]; nested: NumberFormatException[Invalid shift value in prefixCoded string (is encoded value really an INT?)]; }{[6fe786b4-de13-451c-8296-7803b8bbe1d8][index0][2]: RemoteTransportException[[Angel][inet[/10.198.109.171:9300]][search/phase/query]]; nested: QueryPhaseExecutionException[[index0][2]: query[custom score (+userId:4c6b25774f8bd5147ab46cf4 +(body:"john smith" subject:"john smith" to:"john smith" from:"john smith" cc:\john smith"),function=org.elasticsearch.index.query.xcontent.CustomScoreQueryParser$ScriptScoreFunction@7daf32e3)],from[0],size[100]: Query Failed

John_Chang · November 17, 2010, 5:43pm

I should add that the index was created on Elastic Search 0.11 and we upgraded to 0.12.1, without reindexing (which we understood to be not necessary as we are not doing geo searches). We tested it after the upgrade and it seemed fine then; not sure when it went off the rails.

Not expecting this has to do with the upgrade, but just wanted to call it out just in case it was useful info.

Clinton_Gormley · November 17, 2010, 6:01pm

Hi John

On Wed, 2010-11-17 at 09:43 -0800, John Chang wrote:

I should add that the index was created on Elastic Search 0.11 and we
upgraded to 0.12.1, without reindexing (which we understood to be not
necessary as we are not doing geo searches). We tested it after the upgrade
and it seemed fine then; not sure when it went off the rails.

Not expecting this has to do with the upgrade, but just wanted to call it
out just in case it was useful info.

This does sound like your indexed have been corrupted somewhere along
the way. You may have been hit by this bug:

github.com/elastic/elasticsearch

Possible (rare) shard index corruption / different doc count on recovery (gateway / shard)

opened 08:58PM - 01 Nov 10 UTC

closed 04:00AM - 02 Nov 10 UTC

kimchy

>bug v0.13.0

The problem stems from the reusing of existing index files when doing recovery. …Checksums should be added, but, in order to have it performant, they are computed on write. This menas that existing indices will still work, but might suffer from it. New internal index files will get checksummed, and eventually the index will be fully checksummed, though, it is recommended to reindex the data.

Although I'm not sure if that would result in you losing mappings.

Would be worth gist'ing your logs: https://gist.github.com/

clint

John_Chang · November 17, 2010, 8:23pm

Here is a gist of the elastic search logs. However, I don't know if they will useful; they just log some activity about 2 hours before I started seeing the problems noted above in my application logs, and they seem pretty tame:

gist.github.com

https://gist.github.com/anonymous/703964

elastic-node1-log

[22:24:25,337][INFO ][node                     ] [Kogar] {elasticsearch/0.11.0}[9433]: initializing ...
[22:24:25,576][INFO ][plugins                  ] [Kogar] loaded [mapper-attachments, cloud-aws]
[22:24:27,729][INFO ][node                     ] [Kogar] {elasticsearch/0.11.0}[9433]: initialized
[22:24:27,729][INFO ][node                     ] [Kogar] {elasticsearch/0.11.0}[9433]: starting ...
[22:24:27,815][INFO ][transport                ] [Kogar] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.206.98.255:9300]}
[22:24:31,058][INFO ][cluster.service          ] [Kogar] new_master [Kogar][a4491956-42e1-4f7a-bd7a-f61af415acad][inet[/10.206.98.255:9300]], reason: zen-dis
co-join (elected_as_master)
[22:24:31,061][INFO ][discovery                ] [Kogar] dev5/a4491956-42e1-4f7a-bd7a-f61af415acad
[22:24:31,065][INFO ][http                     ] [Kogar] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.206.98.255:9200]}
[22:24:31,065][INFO ][node                     ] [Kogar] {elasticsearch/0.11.0}[9433]: started

This file has been truncated. show original

elastic-node2-log

[22:26:23,157][INFO ][node                     ] [Angel] {elasticsearch/0.11.0}[9410]: initializing ...
[22:26:23,443][INFO ][plugins                  ] [Angel] loaded [mapper-attachments, cloud-aws]
[22:26:26,052][INFO ][node                     ] [Angel] {elasticsearch/0.11.0}[9410]: initialized
[22:26:26,052][INFO ][node                     ] [Angel] {elasticsearch/0.11.0}[9410]: starting ...
[22:26:26,126][INFO ][transport                ] [Angel] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.198.109.171:9300]}
[22:26:29,433][INFO ][cluster.service          ] [Angel] detected_master [Kogar][a4491956-42e1-4f7a-bd7a-f61af415acad][inet[/10.206.98.255:9300]], added {[Se
nor Muerte][03ada34f-1062-4729-949f-d8048093b46c][inet[/10.244.255.242:9304]]{client=true, data=false},[Kogar][a4491956-42e1-4f7a-bd7a-f61af415acad][inet[/10
.206.98.255:9300]],[Vashti][635ac3e4-00de-40db-a84b-a6a5cbefc7a9][inet[/10.244.255.242:9306]]{client=true, data=false},[Dreamqueen][4173bf44-2bed-4b2a-9a6a-d
9ac09c2bb38][inet[/10.244.255.242:9308]]{client=true, data=false},[Master of Vengeance][b8832488-3c29-46dd-bc13-98101b13352b][inet[/10.244.255.242:9305]]{cli
ent=true, data=false},[Alex][f83006d1-a053-4562-9482-46b25698122a][inet[/10.244.255.242:9310]]{client=true, data=false},[Jameson, Dr. Marla][bbbf5072-47d0-47

This file has been truncated. show original

Here is some more info from my application log. It is basically more of what I put in the original post:

gist.github.com

https://gist.github.com/anonymous/704009

elastic-dataonly-node-log

Error Message: org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query], total failure; shardFailures {[a4491956-42e1-4f7a-bd7a-f61af415acad][index0][0]: RemoteTransportException[[Kogar][inet[/10.206.98.255:9300]]
[search/phase/query]]; nested: QueryPhaseExecutionException[[index0][0]: query[custom score (+userId:4c6aeebb4f8bd51475b46cf4 +(body:"john smith" subject:"john smith" to:"john smith" from:"john smith" cc:"john smith"),function=org.elasticsea
rch.index.query.xcontent.CustomScoreQueryParser$ScriptScoreFunction@53ee3e86)],from[0],size[100]: Query Failed [Failed to execute main query]]; nested: CompileException[[Error: Invalid shift value in prefixCoded string (is encoded value really an INT?
)]
[Near : {... Unknown ....}]
             ^
[Line: 1, Column: 0]]; nested: NumberFormatException[Invalid shift value in prefixCoded string (is encoded value really an INT?)]; }{[a4491956-42e1-4f7a-bd7a-f61af415acad][index0][1]: RemoteTransportException[[Kogar][inet[/10.206.98.255:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[index0][1]: query[custom score (+userId:4c6aeebb4f8bd51475b46cf4 +(body:"john smith" subject:"john smith" to:"john smith" from:"john smith" cc:"john smith"),function=org.elasticsearch.inde
x.query.xcontent.CustomScoreQueryParser$ScriptScoreFunction@1e5f3583)],from[0],size[100]: Query Failed [Failed to execute main query]]; nested: CompileException[[Error: Invalid shift value in prefixCoded string (is encoded value really an INT?)]
[Near : {... Unknown ....}]

This file has been truncated. show original

I don't know if this is useful, but I can't think of anything more to post. Let me know if there's something else that I'm missing.

kimchy · November 17, 2010, 8:57pm

It might relate to the possible corruption that might happen that was fixed
in master (upcoming 0.13). I also fixed a possible race condition between
the recovery of an index and the creation of its mappings and an index
operation getting in between the two (the new full cluster and index level
blocks). It sounds like you might have hit both of them... . I assume you
use local gateway?

-shay.banon

On Wed, Nov 17, 2010 at 10:23 PM, John Chang jchangkihtest2@gmail.comwrote:

Here is a gist of the Elasticsearch logs. However, I don't know if they
will useful; they just log some activity about 2 hours before I started
seeing the problems noted above in my application logs, and they seem
pretty
tame:
Elastic Search Nodes Logs · GitHub

Here is some more info from my application log. It is basically more of
what I put in the original post:
Logs from my search app with a data only node · GitHub

I don't know if this is useful, but I can't think of anything more to post.
Let me know if there's something else that I'm missing.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexes-seem-corrupted-tp1918553p1919499.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

John_Chang · November 17, 2010, 10:29pm

I think that's the problem. Yes, we are using local search. Also, what you (kimchy) write makes sense, as the Elastic Search data node logs here https://gist.github.com/703964 show initialization at times that correspond perfectly to when the searches started going bad in our application log (which uses the no-data nodes).

The only thing I wonder is...why did the Elastic Search data nodes decide to reinitialize at that time; we did restart the data node cluster, but that was over 2 hours before this initialization in those logs. What kicks off the initialization other than a service restart?

kimchy · November 17, 2010, 10:35pm

It seems like the network connection got completely broken between the nodes
(you see the transport disconnect reason for nodes being identified as
failed).

You can try and set: discovery.zen.fd.connect_on_network_disconnect to true,
which in such event will try and connect again to the node in question to
make sure it can't be connected.

-shay.banon

On Thu, Nov 18, 2010 at 12:29 AM, John Chang jchangkihtest2@gmail.comwrote:

I think that's the problem. Yes, we are using local search. Also, what
you
(kimchy) write makes sense, as the Elastic Search data node logs here
Elastic Search Nodes Logs · GitHub show initialization at times that
correspond
perfectly to when the searches started going bad in our application log
(which uses the no-data nodes).

The only thing I wonder is...why did the Elastic Search data nodes decide
to
reinitialize at that time; we did restart the data node cluster, but that
was over 2 hours before this initialization in those logs. What kicks off
the initialization other than a service restart?

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexes-seem-corrupted-tp1918553p1920227.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Topic		Replies	Views
NumberFormatException when sorting by numeric document ID Elasticsearch	9	724	July 6, 2017
Shard error on elasticsearch upgrade Elasticsearch	3	620	July 6, 2017
Search exceptions, is value really an INT (field names/type) Elasticsearch	3	2925	July 5, 2017
Invalid shift value in prefixCoded string (is encoded value really an INT?) Elasticsearch	4	1342	July 6, 2017
What can cause a mapping to be corrupted? Elasticsearch	1	350	July 6, 2017

Indexes seem corrupted

The only thing I wonder is...why did the Elastic Search data nodes decide to reinitialize at that time; we did restart the data node cluster, but that was over 2 hours before this initialization in those logs. What kicks off the initialization other than a service restart?

Related topics

The only thing I wonder is...why did the Elastic Search data nodes decide
to
reinitialize at that time; we did restart the data node cluster, but that
was over 2 hours before this initialization in those logs. What kicks off
the initialization other than a service restart?