Timeouts and index corruptions

Hi,

We have quite a few problems with our ES cluster here after upgrading from
1.2.1 to to 1.4.1 .
The upgrade was done quite brutally, as the new version of es was installed
and the cluster restarted by a scm.

First of all, after upgrading and restarting ES, multiple shards started
giving us errors regarding checksums and preexisting corrupted indexes.

The lucene checkIndex tool claimed that said shards contained corrupted
data, and recreated segments (with a bit of document loss, but that's all
normal).
The data that was lost was too important to us though and we decided to
restore the index from a snapshot, and it worked (yay!).

All went well but a few hours later, when ES started getting a little bit
more load, it started timing out on some queries.
The timeout behavior seems random (when it comes to query types).
Then the garbage collector started warning us on all indices :
[2015-02-27 10:25:09,814][WARN ][monitor.jvm ] [node1]
[gc][old][83449][548] duration [27s], collections [2]/[13.3s], total
[27s]/[1.2h], memory [4.9gb]->[4.8gb]/[31.8gb], all_pools {[young]
[85.2mb]->[27.9mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[4.8gb]->[4.6gb]/[30.9gb]}
I don't know exactly how to read these but it seems like something isn't
right, right ?

So after loading the snapshot and writing the delta back into the index we
performed another snapshot.
That snapshot is actually written (although dreadfully slow to create)
claimed "OK" by ES but when we try to restore an index from it, we get
corrupted shards again. (duh !)

Sadly it appears that our java version doesn't fit the recommandations
(7#51)
Is there anything that we would need to do to upgrade the cluster
"properly" ?

Also, the snapshots now take FOREVER to perform, is it normal ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/539e8fe7-d879-4514-ade1-8f4e43d06369%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Maybe this can help,

Since we restarted in the new version, with the same configured heap the
memory consumption changed A LOT
See memory graphs here

The node stats shows very little heap consumed (like 15%) but a lot of the
other metrics seem off :
"mem" : {
"heap_used_in_bytes" : 4934366152,
"heap_used_percent" : 14,
"heap_committed_in_bytes" : 34246361088,
"heap_max_in_bytes" : 34246361088,
"non_heap_used_in_bytes" : 108007424,
"non_heap_committed_in_bytes" : 119799808,
"pools" : {
"young" : {
"used_in_bytes" : 500486864,
"max_in_bytes" : 907345920,
"peak_used_in_bytes" : 907345920,
"peak_max_in_bytes" : 907345920
},
"survivor" : {
"used_in_bytes" : 63272264,
"max_in_bytes" : 113377280,
"peak_used_in_bytes" : 113377280,
"peak_max_in_bytes" : 113377280
},
"old" : {
"used_in_bytes" : 4370607024,
"max_in_bytes" : 33225637888,
"peak_used_in_bytes" : 4443735936,
"peak_max_in_bytes" : 33225637888
}
}
},

Is it helping ?

Le vendredi 27 février 2015 15:18:29 UTC+1, Gabriel Flory a écrit :

Hi,

We have quite a few problems with our ES cluster here after upgrading from
1.2.1 to to 1.4.1 .
The upgrade was done quite brutally, as the new version of es was
installed and the cluster restarted by a scm.

First of all, after upgrading and restarting ES, multiple shards started
giving us errors regarding checksums and preexisting corrupted indexes.

The lucene checkIndex tool claimed that said shards contained corrupted
data, and recreated segments (with a bit of document loss, but that's all
normal).
The data that was lost was too important to us though and we decided to
restore the index from a snapshot, and it worked (yay!).

All went well but a few hours later, when ES started getting a little bit
more load, it started timing out on some queries.
The timeout behavior seems random (when it comes to query types).
Then the garbage collector started warning us on all indices :
[2015-02-27 10:25:09,814][WARN ][monitor.jvm ] [node1]
[gc][old][83449][548] duration [27s], collections [2]/[13.3s], total
[27s]/[1.2h], memory [4.9gb]->[4.8gb]/[31.8gb], all_pools {[young]
[85.2mb]->[27.9mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[4.8gb]->[4.6gb]/[30.9gb]}
I don't know exactly how to read these but it seems like something isn't
right, right ?

So after loading the snapshot and writing the delta back into the index we
performed another snapshot.
That snapshot is actually written (although dreadfully slow to create)
claimed "OK" by ES but when we try to restore an index from it, we get
corrupted shards again. (duh !)

Sadly it appears that our java version doesn't fit the recommandations
(7#51)
Is there anything that we would need to do to upgrade the cluster
"properly" ?

Also, the snapshots now take FOREVER to perform, is it normal ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b72c996b-89f0-46a4-ba29-71ad549d5075%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.