Last night we had elasticsearch screw up big time in production. Let me
describe the issue:
Via /_plugin/head the cluster overview was completely empty - it wasn't
even showing itself in the list of nodes.
Obviously, no indexes were showing, however in the data browser, it WAS
showing indexes existed
In cluster status, all shards were showing on itself, however were
showing as 'unassigned'
Restarting the service on one of the nodes did not help - instead the logs
showed it trying to connect to itself as master under its previous node
name, not its new node name. This then failed saying it wasn't a master.
Both nodes were not talking to eachother and doing this exact same thing,
trying to elect themselves as master and failing. The head plugin was
available and loading, however cluster state was showing as 'not connected'.
It wasn't until we shut down the services on both nodes, added two fresh
nodes in sequence (giving the first time to elect itself as master), adding
2nd, them talking, then continuing do bring up the original 2 nodes that it
started showing all the nodes in the cluster instead of a completely empty
cluster state.
We are using the EC2 security groups plugin running on EC2 instances using
version 0.90.2 with no other plugins installed - sematext is installed and
running in standalone mode talking to the JVM instead of being loaded via a
plugin. Using openjdk 7.
Has anyone else encountered this issue where it seems the clustering layer
completely fails? Is this a known bug? I can't see any reference of this
issue in the mailing list yet.
I'd attach cluster output but since it was production and sites were down,
in our haste to fix it we didn't dump any health statuses. It's also worth
mentioning that the logs were completely void of any information when the
nodes stopped serving the indexes with previous log entry being about 24
hours prior about a failed search.
Thanks to anyone who can help or lead us in the right direction of what
happened. We're also about to start digging through to do an RCA.
My first thought is that you encountered a split brain scenario, but to
have head show a complete empty cluster is strange. With split brain, you
will at least see part of the data.
Have you tried to debug with the es2unix utility? ( GitHub - elastic/es2unix: Command-line ES ) Use the lifecycle command to
understand what the master status was during the failure.
I personally do not care for the openjdk JVM. Too many issues in the past,
but it has been awhile since I have used it.
Last night we had elasticsearch screw up big time in production. Let me
describe the issue:
Via /_plugin/head the cluster overview was completely empty - it wasn't
even showing itself in the list of nodes.
Obviously, no indexes were showing, however in the data browser, it WAS
showing indexes existed
In cluster status, all shards were showing on itself, however were
showing as 'unassigned'
Restarting the service on one of the nodes did not help - instead the logs
showed it trying to connect to itself as master under its previous node
name, not its new node name. This then failed saying it wasn't a master.
Both nodes were not talking to eachother and doing this exact same thing,
trying to elect themselves as master and failing. The head plugin was
available and loading, however cluster state was showing as 'not connected'.
It wasn't until we shut down the services on both nodes, added two fresh
nodes in sequence (giving the first time to elect itself as master), adding
2nd, them talking, then continuing do bring up the original 2 nodes that it
started showing all the nodes in the cluster instead of a completely empty
cluster state.
We are using the EC2 security groups plugin running on EC2 instances using
version 0.90.2 with no other plugins installed - sematext is installed and
running in standalone mode talking to the JVM instead of being loaded via a
plugin. Using openjdk 7.
Has anyone else encountered this issue where it seems the clustering layer
completely fails? Is this a known bug? I can't see any reference of this
issue in the mailing list yet.
I'd attach cluster output but since it was production and sites were down,
in our haste to fix it we didn't dump any health statuses. It's also worth
mentioning that the logs were completely void of any information when the
nodes stopped serving the indexes with previous log entry being about 24
hours prior about a failed search.
Thanks to anyone who can help or lead us in the right direction of what
happened. We're also about to start digging through to do an RCA.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.