Elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this

Our production cluster went yellow last night after our logstash index
rolled over to the next version. I've seen this happen before but this time
I decided to properly diagnose and seek some feedback on what might be
going on.

So, I'd love some feedback on what is going on. I'm happy to keep this
cluster in a yellow state for a limited time to get some help from people
in this group trying to diagnose this properly and maybe help some others
who face the same issues. However, I will need to fix this one way or
another before end of business day today. I plan to perform a rolling
restart to see if node reinitialization fixes things. If not, I'll remove
the problematic logstash index and move on. I'd love suggesttions for less
intrusive solutions. I don't like losing data and rolling restarts are kind
of tedious to babysit. Tends to take 45 minutes or so.

Below is some information I've gathered. Let me know if you need me to
extract more data.

First the obvious:

{
"status" : 200,
"name" : "192.168.1.13",
"cluster_name" : "linko_elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

[linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 221,
"active_shards" : 619,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So we're yellow and the reason is initializing and unassigned shards. We
have five nodes, of which three are data nodes. It seems we are hitting
some kind of resilience issue. The three machines have plenty of diskspace
and memory.

I found this in the log of one of our es nodes:
[2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message
[RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from
[192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/192.168.1.14:9300]]
into
[192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}];
nested:
RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
failed]; nested:
RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
nested: NumberFormatException[For input string: "finished"]; ]]

on the mentioned node there's a corresponding messages:
[2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform
[indices:data/write/bulk[s]] on replica, message
[RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
nested: NumberFormatException[For input string: "finished"]; ]]

All three data nodes have similar messages happening over and over again.

Our cluster has been up for a couple of weeks and seems pretty happy
otherwise. I deleted some older logstash indices a few days ago. The
cluster has logstash data and a few smallish indiceses we use for our
inbot.io service. The issue appears to be related to the logstash index
rollover. Our app servers and kibana talk to the two non data nodes that we
run on both our application servers.

My next stop was kibana which we use on the same cluster with the logstash
index that is probably causing us issues. Looking at that, I noticed a few
interesting things:

  • logstash indexing seems to be fine (good) and it appears there has
    been no data loss yet
  • our cpu load jumped around midnight and sort of stayed up on all three
    nodes. We measure this using collectd and both mean and max load jumped to
    around 1 around the time the index rollover happened.

My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'

All the indices listed there looked fine. I'll spare you the output but
everything appared to be in the 'done' stage.

Finally, I did
[linko@es3 elasticsearch]$ curl -XGET
'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So, that confirms last night's new logstash index is the issue and it seems
to have issues initializing two of its shards. With some Googling, I also
found a recently closed issues
(e.g. https://github.com/elasticsearch/elasticsearch/issues/6808) that seem
related to this for 1.3.x. However, I'm on 1.4 where this is supposedly
fixed. The NumberFormatException mentioned in the logs seems suspicious for
sure.

My questions are: what is going on, how do I fix it properly and without
data loss or downtime, and I do I stop this from happening? A rolling
restart might fix things (seen that happen before).

Maybe there are some magic curl incantation than nudges things along that
I've overlooked? I've dug around on google but all I seem to find is advice
along the lines of deleting indices and/or maybe trying this or that by
poking around with various apis in the elasticsearch internals. I'm looking
for something a bit more responsible/structured and am a bit wary of the
'shotgun' approach here. Yes, I know I can fix things by deleting stuff or
restarting stuff and doing a hail mary. However, my main concern is that
the need for this seems to be a regularly recurring thing (done it a dozen
times over the course of the last year for different issues). What I
really need is simple instructions that I can give to our people to fix
things when things go bad: when x happens do this and you'll be fine. The
last thing I want is ops people who get in the habit of deleting indices or
doing random rolling restarts out of magical thinking. That sounds like
something that could cause more issues than that it solves. I know from
experience a botched rolling restart is not nice to deal with.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This looks like a mapping issue to me (not 100% sure). A document that is
in the translog has a string field (with value: 'finished'), but it is
mapped as a number field (long, integer, double, etc.) in the mapping. This
causes the number format exception that you're seeing in your logs when
that document is indexed from the translog as part of the recovery and this
then prevents the shard from getting started.

These problems can occur when new fields are introduced at index time and
also when numeric_detection is enabled in the mapping (which makes these
errors more likely). Is this the case in your ES setup?

Can you also check the mappings of the logstash-2014.11.27 index and see
what fields can possible contain 'finished'? Unfortunately the field name
didn't get included with your errors.

On 27 November 2014 at 11:19, Jilles van Gurp jillesvangurp@gmail.com
wrote:

Our production cluster went yellow last night after our logstash index
rolled over to the next version. I've seen this happen before but this time
I decided to properly diagnose and seek some feedback on what might be
going on.

So, I'd love some feedback on what is going on. I'm happy to keep this
cluster in a yellow state for a limited time to get some help from people
in this group trying to diagnose this properly and maybe help some others
who face the same issues. However, I will need to fix this one way or
another before end of business day today. I plan to perform a rolling
restart to see if node reinitialization fixes things. If not, I'll remove
the problematic logstash index and move on. I'd love suggesttions for less
intrusive solutions. I don't like losing data and rolling restarts are kind
of tedious to babysit. Tends to take 45 minutes or so.

Below is some information I've gathered. Let me know if you need me to
extract more data.

First the obvious:

{
"status" : 200,
"name" : "192.168.1.13",
"cluster_name" : "linko_elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

[linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 221,
"active_shards" : 619,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So we're yellow and the reason is initializing and unassigned shards. We
have five nodes, of which three are data nodes. It seems we are hitting
some kind of resilience issue. The three machines have plenty of diskspace
and memory.

I found this in the log of one of our es nodes:
[2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message
[RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from
[192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/
192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested:
RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
failed]; nested:
RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
nested: NumberFormatException[For input string: "finished"]; ]]

on the mentioned node there's a corresponding messages:
[2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform
[indices:data/write/bulk[s]] on replica, message
[RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
nested: NumberFormatException[For input string: "finished"]; ]]

All three data nodes have similar messages happening over and over again.

Our cluster has been up for a couple of weeks and seems pretty happy
otherwise. I deleted some older logstash indices a few days ago. The
cluster has logstash data and a few smallish indiceses we use for our
inbot.io service. The issue appears to be related to the logstash index
rollover. Our app servers and kibana talk to the two non data nodes that we
run on both our application servers.

My next stop was kibana which we use on the same cluster with the logstash
index that is probably causing us issues. Looking at that, I noticed a few
interesting things:

  • logstash indexing seems to be fine (good) and it appears there has
    been no data loss yet
  • our cpu load jumped around midnight and sort of stayed up on all
    three nodes. We measure this using collectd and both mean and max load
    jumped to around 1 around the time the index rollover happened.

My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'

All the indices listed there looked fine. I'll spare you the output but
everything appared to be in the 'done' stage.

Finally, I did
[linko@es3 elasticsearch]$ curl -XGET
'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So, that confirms last night's new logstash index is the issue and it
seems to have issues initializing two of its shards. With some Googling,
I also found a recently closed issues (e.g.
https://github.com/elasticsearch/elasticsearch/issues/6808) that seem
related to this for 1.3.x. However, I'm on 1.4 where this is supposedly
fixed. The NumberFormatException mentioned in the logs seems suspicious for
sure.

My questions are: what is going on, how do I fix it properly and without
data loss or downtime, and I do I stop this from happening? A rolling
restart might fix things (seen that happen before).

Maybe there are some magic curl incantation than nudges things along that
I've overlooked? I've dug around on google but all I seem to find is advice
along the lines of deleting indices and/or maybe trying this or that by
poking around with various apis in the elasticsearch internals. I'm
looking for something a bit more responsible/structured and am a bit wary
of the 'shotgun' approach here. Yes, I know I can fix things by deleting
stuff or restarting stuff and doing a hail mary. However, my main concern
is that the need for this seems to be a regularly recurring thing (done it
a dozen times over the course of the last year for different issues). What
I really need is simple instructions that I can give to our people to fix
things when things go bad: when x happens do this and you'll be fine. The
last thing I want is ops people who get in the habit of deleting indices or
doing random rolling restarts out of magical thinking. That sounds like
something that could cause more issues than that it solves. I know from
experience a botched rolling restart is not nice to deal with.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TzBQt5i1_jFwBs_b5DPK2F0tsdo3UV4RgtQMJwh33B6yQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the explanation. I suspect many logstash users might be running
into this one since you typically use a dynamic mapping with that. We have
some idea where this is happening though and we can probably fix it
properly. This happened during index roll over and we indeed are indexing a
lot of things via logstash almost continuously.

Jilles

On Thursday, November 27, 2014 4:06:21 PM UTC+1, Martijn v Groningen wrote:

This looks like a mapping issue to me (not 100% sure). A document that is
in the translog has a string field (with value: 'finished'), but it is
mapped as a number field (long, integer, double, etc.) in the mapping. This
causes the number format exception that you're seeing in your logs when
that document is indexed from the translog as part of the recovery and this
then prevents the shard from getting started.

These problems can occur when new fields are introduced at index time and
also when numeric_detection is enabled in the mapping (which makes these
errors more likely). Is this the case in your ES setup?

Can you also check the mappings of the logstash-2014.11.27 index and see
what fields can possible contain 'finished'? Unfortunately the field name
didn't get included with your errors.

On 27 November 2014 at 11:19, Jilles van Gurp <jilles...@gmail.com
<javascript:>> wrote:

Our production cluster went yellow last night after our logstash index
rolled over to the next version. I've seen this happen before but this time
I decided to properly diagnose and seek some feedback on what might be
going on.

So, I'd love some feedback on what is going on. I'm happy to keep this
cluster in a yellow state for a limited time to get some help from people
in this group trying to diagnose this properly and maybe help some others
who face the same issues. However, I will need to fix this one way or
another before end of business day today. I plan to perform a rolling
restart to see if node reinitialization fixes things. If not, I'll remove
the problematic logstash index and move on. I'd love suggesttions for less
intrusive solutions. I don't like losing data and rolling restarts are kind
of tedious to babysit. Tends to take 45 minutes or so.

Below is some information I've gathered. Let me know if you need me to
extract more data.

First the obvious:

{
"status" : 200,
"name" : "192.168.1.13",
"cluster_name" : "linko_elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

[linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 221,
"active_shards" : 619,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So we're yellow and the reason is initializing and unassigned shards. We
have five nodes, of which three are data nodes. It seems we are hitting
some kind of resilience issue. The three machines have plenty of diskspace
and memory.

I found this in the log of one of our es nodes:
[2014-11-27 10:15:12,585][WARN ][cluster.action.shard ]
[192.168.1.13] [logstash-2014.11.27][4] sending failed shard for
[logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R],
s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to
start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]:
Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io
][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested:
RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
failed]; nested:
RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
nested: NumberFormatException[For input string: "finished"]; ]]

on the mentioned node there's a corresponding messages:
[2014-11-27 10:17:54,187][WARN ][cluster.action.shard ]
[192.168.1.14] [logstash-2014.11.27][4] sending failed shard for
[logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R],
s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to
perform [indices:data/write/bulk[s]] on replica, message
[RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
nested: NumberFormatException[For input string: "finished"]; ]]

All three data nodes have similar messages happening over and over again.

Our cluster has been up for a couple of weeks and seems pretty happy
otherwise. I deleted some older logstash indices a few days ago. The
cluster has logstash data and a few smallish indiceses we use for our
inbot.io service. The issue appears to be related to the logstash index
rollover. Our app servers and kibana talk to the two non data nodes that we
run on both our application servers.

My next stop was kibana which we use on the same cluster with the
logstash index that is probably causing us issues. Looking at that, I
noticed a few interesting things:

  • logstash indexing seems to be fine (good) and it appears there has
    been no data loss yet
  • our cpu load jumped around midnight and sort of stayed up on all
    three nodes. We measure this using collectd and both mean and max load
    jumped to around 1 around the time the index rollover happened.

My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'

All the indices listed there looked fine. I'll spare you the output but
everything appared to be in the 'done' stage.

Finally, I did
[linko@es3 elasticsearch]$ curl -XGET
'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So, that confirms last night's new logstash index is the issue and it
seems to have issues initializing two of its shards. With some
Googling, I also found a recently closed issues (e.g.
https://github.com/elasticsearch/elasticsearch/issues/6808) that seem
related to this for 1.3.x. However, I'm on 1.4 where this is supposedly
fixed. The NumberFormatException mentioned in the logs seems suspicious for
sure.

My questions are: what is going on, how do I fix it properly and without
data loss or downtime, and I do I stop this from happening? A rolling
restart might fix things (seen that happen before).

Maybe there are some magic curl incantation than nudges things along that
I've overlooked? I've dug around on google but all I seem to find is advice
along the lines of deleting indices and/or maybe trying this or that by
poking around with various apis in the elasticsearch internals. I'm
looking for something a bit more responsible/structured and am a bit wary
of the 'shotgun' approach here. Yes, I know I can fix things by deleting
stuff or restarting stuff and doing a hail mary. However, my main concern
is that the need for this seems to be a regularly recurring thing (done it
a dozen times over the course of the last year for different issues). What
I really need is simple instructions that I can give to our people to fix
things when things go bad: when x happens do this and you'll be fine. The
last thing I want is ops people who get in the habit of deleting indices or
doing random rolling restarts out of magical thinking. That sounds like
something that could cause more issues than that it solves. I know from
experience a botched rolling restart is not nice to deal with.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cafa7488-72c2-4d95-b8b5-996c04757a41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If the field you suspect causing this is a string field in the mapping then
you can try to close and open the index. This will then sync the in-memory
representation of the mapping with what is in the cluster state.

On 27 November 2014 at 16:49, Jilles van Gurp jillesvangurp@gmail.com
wrote:

Thanks for the explanation. I suspect many logstash users might be running
into this one since you typically use a dynamic mapping with that. We have
some idea where this is happening though and we can probably fix it
properly. This happened during index roll over and we indeed are indexing a
lot of things via logstash almost continuously.

Jilles

On Thursday, November 27, 2014 4:06:21 PM UTC+1, Martijn v Groningen wrote:

This looks like a mapping issue to me (not 100% sure). A document that is
in the translog has a string field (with value: 'finished'), but it is
mapped as a number field (long, integer, double, etc.) in the mapping. This
causes the number format exception that you're seeing in your logs when
that document is indexed from the translog as part of the recovery and this
then prevents the shard from getting started.

These problems can occur when new fields are introduced at index time and
also when numeric_detection is enabled in the mapping (which makes these
errors more likely). Is this the case in your ES setup?

Can you also check the mappings of the logstash-2014.11.27 index and see
what fields can possible contain 'finished'? Unfortunately the field name
didn't get included with your errors.

On 27 November 2014 at 11:19, Jilles van Gurp jilles...@gmail.com
wrote:

Our production cluster went yellow last night after our logstash index
rolled over to the next version. I've seen this happen before but this time
I decided to properly diagnose and seek some feedback on what might be
going on.

So, I'd love some feedback on what is going on. I'm happy to keep this
cluster in a yellow state for a limited time to get some help from people
in this group trying to diagnose this properly and maybe help some others
who face the same issues. However, I will need to fix this one way or
another before end of business day today. I plan to perform a rolling
restart to see if node reinitialization fixes things. If not, I'll remove
the problematic logstash index and move on. I'd love suggesttions for less
intrusive solutions. I don't like losing data and rolling restarts are kind
of tedious to babysit. Tends to take 45 minutes or so.

Below is some information I've gathered. Let me know if you need me to
extract more data.

First the obvious:

{
"status" : 200,
"name" : "192.168.1.13",
"cluster_name" : "linko_elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

[linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 221,
"active_shards" : 619,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So we're yellow and the reason is initializing and unassigned shards. We
have five nodes, of which three are data nodes. It seems we are hitting
some kind of resilience issue. The three machines have plenty of diskspace
and memory.

I found this in the log of one of our es nodes:
[2014-11-27 10:15:12,585][WARN ][cluster.action.shard ]
[192.168.1.13] [logstash-2014.11.27][4] sending failed shard for
[logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R],
s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to
start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]:
Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io
][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested:
RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:
9300]][internal:index/shard/recovery/start_recovery]]; nested:
RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
failed]; nested: RemoteTransportException[[192.168.1.13][inet[/
192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
nested: NumberFormatException[For input string: "finished"]; ]]

on the mentioned node there's a corresponding messages:
[2014-11-27 10:17:54,187][WARN ][cluster.action.shard ]
[192.168.1.14] [logstash-2014.11.27][4] sending failed shard for
[logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R],
s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to
perform [indices:data/write/bulk[s]] on replica, message
[RemoteTransportException[[192.168.1.13][inet[/192.168.1.
13:9300]][indices:data/write/bulk[s][r]]]; nested:
NumberFormatException[For input string: "finished"]; ]]

All three data nodes have similar messages happening over and over again.

Our cluster has been up for a couple of weeks and seems pretty happy
otherwise. I deleted some older logstash indices a few days ago. The
cluster has logstash data and a few smallish indiceses we use for our
inbot.io service. The issue appears to be related to the logstash index
rollover. Our app servers and kibana talk to the two non data nodes that we
run on both our application servers.

My next stop was kibana which we use on the same cluster with the
logstash index that is probably causing us issues. Looking at that, I
noticed a few interesting things:

  • logstash indexing seems to be fine (good) and it appears there has
    been no data loss yet
  • our cpu load jumped around midnight and sort of stayed up on all
    three nodes. We measure this using collectd and both mean and max load
    jumped to around 1 around the time the index rollover happened.

My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'

All the indices listed there looked fine. I'll spare you the output but
everything appared to be in the 'done' stage.

Finally, I did
[linko@es3 elasticsearch]$ curl -XGET 'localhost:9200/_cluster/
health/logstash-2014.11.27/?pretty'
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So, that confirms last night's new logstash index is the issue and it
seems to have issues initializing two of its shards. With some
Googling, I also found a recently closed issues (e.g.
https://github.com/elasticsearch/elasticsearch/issues/6808) that seem
related to this for 1.3.x. However, I'm on 1.4 where this is supposedly
fixed. The NumberFormatException mentioned in the logs seems suspicious for
sure.

My questions are: what is going on, how do I fix it properly and without
data loss or downtime, and I do I stop this from happening? A rolling
restart might fix things (seen that happen before).

Maybe there are some magic curl incantation than nudges things along
that I've overlooked? I've dug around on google but all I seem to find is
advice along the lines of deleting indices and/or maybe trying this or that
by poking around with various apis in the elasticsearch internals. I'm
looking for something a bit more responsible/structured and am a bit wary
of the 'shotgun' approach here. Yes, I know I can fix things by deleting
stuff or restarting stuff and doing a hail mary. However, my main concern
is that the need for this seems to be a regularly recurring thing (done it
a dozen times over the course of the last year for different issues). What
I really need is simple instructions that I can give to our people to fix
things when things go bad: when x happens do this and you'll be fine. The
last thing I want is ops people who get in the habit of deleting indices or
doing random rolling restarts out of magical thinking. That sounds
like something that could cause more issues than that it solves. I know
from experience a botched rolling restart is not nice to deal with.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cafa7488-72c2-4d95-b8b5-996c04757a41%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cafa7488-72c2-4d95-b8b5-996c04757a41%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TxDpUn4-TaGXepdNqjprmgPSg7%2BtddaqsncmyaMP4-WYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

BTW. I should mention that I also filed a bug for this earlier today.
https://github.com/elasticsearch/elasticsearch/issues/8684

Clinton Gormley kindly replied to that and provided some additional insight.

It indeed seems our mapping is part of the problem but there's also the es
side of things where it shouldn't get in this state. Apparently a fix for
that part is coming.

Best,

Jilles

On Thursday, November 27, 2014 11:19:20 AM UTC+1, Jilles van Gurp wrote:

Our production cluster went yellow last night after our logstash index
rolled over to the next version. I've seen this happen before but this time
I decided to properly diagnose and seek some feedback on what might be
going on.

So, I'd love some feedback on what is going on. I'm happy to keep this
cluster in a yellow state for a limited time to get some help from people
in this group trying to diagnose this properly and maybe help some others
who face the same issues. However, I will need to fix this one way or
another before end of business day today. I plan to perform a rolling
restart to see if node reinitialization fixes things. If not, I'll remove
the problematic logstash index and move on. I'd love suggesttions for less
intrusive solutions. I don't like losing data and rolling restarts are kind
of tedious to babysit. Tends to take 45 minutes or so.

Below is some information I've gathered. Let me know if you need me to
extract more data.

First the obvious:

{
"status" : 200,
"name" : "192.168.1.13",
"cluster_name" : "linko_elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

[linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 221,
"active_shards" : 619,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So we're yellow and the reason is initializing and unassigned shards. We
have five nodes, of which three are data nodes. It seems we are hitting
some kind of resilience issue. The three machines have plenty of diskspace
and memory.

I found this in the log of one of our es nodes:
[2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message
[RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from
[192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/
192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested:
RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
failed]; nested:
RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
nested: NumberFormatException[For input string: "finished"]; ]]

on the mentioned node there's a corresponding messages:
[2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform
[indices:data/write/bulk[s]] on replica, message
[RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
nested: NumberFormatException[For input string: "finished"]; ]]

All three data nodes have similar messages happening over and over again.

Our cluster has been up for a couple of weeks and seems pretty happy
otherwise. I deleted some older logstash indices a few days ago. The
cluster has logstash data and a few smallish indiceses we use for our
inbot.io service. The issue appears to be related to the logstash index
rollover. Our app servers and kibana talk to the two non data nodes that we
run on both our application servers.

My next stop was kibana which we use on the same cluster with the logstash
index that is probably causing us issues. Looking at that, I noticed a few
interesting things:

  • logstash indexing seems to be fine (good) and it appears there has
    been no data loss yet
  • our cpu load jumped around midnight and sort of stayed up on all
    three nodes. We measure this using collectd and both mean and max load
    jumped to around 1 around the time the index rollover happened.

My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'

All the indices listed there looked fine. I'll spare you the output but
everything appared to be in the 'done' stage.

Finally, I did
[linko@es3 elasticsearch]$ curl -XGET
'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}

So, that confirms last night's new logstash index is the issue and it
seems to have issues initializing two of its shards. With some Googling,
I also found a recently closed issues (e.g.
https://github.com/elasticsearch/elasticsearch/issues/6808) that seem
related to this for 1.3.x. However, I'm on 1.4 where this is supposedly
fixed. The NumberFormatException mentioned in the logs seems suspicious for
sure.

My questions are: what is going on, how do I fix it properly and without
data loss or downtime, and I do I stop this from happening? A rolling
restart might fix things (seen that happen before).

Maybe there are some magic curl incantation than nudges things along that
I've overlooked? I've dug around on google but all I seem to find is advice
along the lines of deleting indices and/or maybe trying this or that by
poking around with various apis in the elasticsearch internals. I'm
looking for something a bit more responsible/structured and am a bit wary
of the 'shotgun' approach here. Yes, I know I can fix things by deleting
stuff or restarting stuff and doing a hail mary. However, my main concern
is that the need for this seems to be a regularly recurring thing (done it
a dozen times over the course of the last year for different issues). What
I really need is simple instructions that I can give to our people to fix
things when things go bad: when x happens do this and you'll be fine. The
last thing I want is ops people who get in the habit of deleting indices or
doing random rolling restarts out of magical thinking. That sounds like
something that could cause more issues than that it solves. I know from
experience a botched rolling restart is not nice to deal with.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4d824d0-387f-410b-b4ff-fb49cbd18be0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.