Unusual logging

Dan_Fairs · September 19, 2013, 11:29am

Hi,

We experienced some problems yesterday with our (ageing) ES cluster running 0.19.8 on 9 nodes on Ubuntu 12.04LTS, using Oracle Java (1.7.0_25-b15). We have around a hundred indices, each with 5 shards and 2 replicas.

I've put the logs from our nodes around the time it happened here:

https://gist.github.com/danfairs/3cd2b5133eb92f4a60f2

One server (mc-3, Paibo) which was the master at the time, doesn't have logs available. It was logging so quickly our log rotation rotated the appropriate logs away. A few hours later, from the logs we do have, it appeared to be logging shard state (which I've never seen logged before):

--------[production_tweets_for_2011_50_alt][1], node[1MSmEqUtR0WPqk3fR2rwaA], [R], s[STARTED]
--------[production_tweets_for_2011_50_alt][2], node[1MSmEqUtR0WPqk3fR2rwaA], [R], s[STARTED]
--------[production_tweets_for_2011_50_alt][3], node[1MSmEqUtR0WPqk3fR2rwaA], [R], s[STARTED]
--------[production_tweets_for_2011_50_alt][4], node[1MSmEqUtR0WPqk3fR2rwaA], [R], s[STARTED]
--------[production_tweets_for_2012_23][0], node[1MSmEqUtR0WPqk3fR2rwaA], [P], s[STARTED]

Here's a brief summary of what appeared to happen in sequence on each node. At the start of all this, mc-3 (Paibo) was master.

mc-1 (Blind Justice) never saw a change in cluster state, just saw missing indices
mc-2 (Hurricane) had many failed master pings, logged master_left for Paibo, new master Zaladane based on timeouts. Zaladane then left, elected Blind Justice. Blind justice then left, elected Hurricane (ie. itself). Then logged not enough master nodes.
mc-3 (Paibo) unable to see logs from time, lots of logs of cluster/shard state.
mc-4 (Harald Jaekelsson) new master Hurricane, previous Paibo, from Hurricane. Then Hurricane left, so unable to form cluster. Current nodes: Harald Jaekelsson, Outrage, Living Hulk, Gibbon, Pip the Troll.
mc-5 (Zaldane) never saw a change in cluster state, just saw missing indices.
mc-6 (Gibbon) new master Hurricane, previous Paibo, from Hurricane. Then Hurricane left, so unable to form cluster. Current nodes: Harald Jaekelsson, Outrage, Living Hulk, Gibbon, Pip the Troll.
mc-7 (Living Hulk) Many failed master (Paibo) pings, logged new master Hurricane. Then logged hurricane master_left.
mc-8 (Outrage) new master Hurricane, previous Paibo, from Hurricane. Then Hurricane left, so unable to form cluster. Current nodes: Harald Jaekelsson, Outrage, Living Hulk, Gibbon, Pip the Troll.
server-5 (Pip the troll) new master Hurricane, previous Paibo. Hurricane left, unable to form cluster.

Around the time, disk I/O on mc-1, mc-3, mc-5 dropped off to near zero, while it increased on mc-2, mc-4, mc-6, mc-7, mc-8 and server-5.

I'm not concerned about machines not being able to form clusters - it's become apparent that a couple of nodes (I'm not sure which, now) hadn't been restarted last time there was a config change, and so their minimum_master_nodes was set incorrectly. This also explains the differing heap sizes between the machines (some have 16GB, some have 20GB). The nodes each have 32GB RAM. There are no other obvious bottlenecks here (network, disk space, CPU, etc.)

I'm primarily interested in three nodes: mc-1 (Blind Justice), mc-3 (Paibo), and mc-5 (Zaldane). What kind of thing would cause a master to start spewing logs like the above? And why might mc-5 (Zaldane) apparently not see any of the cluster state changes that were happening around it - one node (mc-2, Hurricane) even thought Zaldane was master, briefly? Similar for mc-1.

Upgrading our ES version is on the roadmap too

I don't expect to get to a root cause this time, but knowing what kind of thing might have caused mc-3, the original master, to start spewing cluster state into its logs would be extremely helpful.

Thanks,
Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
ES cluster becomes unresponsive Elasticsearch	2	716	July 6, 2017
Another node tries to become master (possibly due to GC hangs) Elasticsearch	4	406	July 6, 2017
Cluster goes into red, some shards in initializing state Elasticsearch	8	1990	July 5, 2017
Master node failure causes cluster to fail Elasticsearch	3	1677	July 6, 2017
ES Stops Works Randomly Elasticsearch	1	469	July 6, 2017

Unusual logging

Related topics