Weird cluster failure (unresponsive node + unresponsive load balancer)

James_Smith · July 11, 2013, 4:22pm

Hi all,

We have recently experienced a weird failure whereby one node became
unresponsive (maxed inode table?), but then even more worryingly, the 'no
data' load balancer node didn't balance the traffic off to the remaining
healthy node. Instead it became unresponsive itself.

As soon as we killed the unresponsive data node, the load balancer starting
responding and using the healthy node.

If anyone has an idea of what this could have been, I would greatly
appreciate your input!

Details below...

Cheers, James.

====

Architecture:

2 ElasticSearch data nodes behind an ElasticSearch 'non data' node, acting
as a load balancer.
Both nodes contains all shards.
ElasticSearch version 20.2 on all nodes.
Linux / ext4 filesystem.

====

Problem 1: ElasticSearch data node #1 became unresponsive to queries for 30
mins, and we eventually killed the process.

We believe that over 3 hours or so, it used up and finally maxed out it's
inode table.

Before and during unresponsive period:

Disk io: high
Disk usage: elasticsearch directory gradually filling (got to 90%),
rather than staying stable
Number of threads: climbing quickly
Cpu usage: high
Load average: high
Inode table usage: climbing in an arc and then maxing out, I believe
Memory usage: normal

Once process was killed:

Disk io: almost 0
Disk usage: remaining high, but stable
Number of threads: low and stable
Cpu usage: almost 0
Load average: almost 0
Inode table usage: flat near maximum
Memory usage: normal

After process was restarted:

Disk io: normal
Disk usage: cleaned up files and back to normal amount
Number of threads: low and stable
Cpu usage: normal
Load average: normal
Inode table usage: cleaned up; low and stable
Memory usage: normal

====

Problem 2: Whilst ElasticSearch data node #1 was unresponsive, the 'non
data' load balancer node did not balance the traffic off to the healthy
data node #2. It did not even respond to 50% of queries. Instead, it became
fully unresponsive itself.

At this point, the load balancer and node #1 were fully unresponsive, but
node #2 was fully responsive.

Once data node #1 was killed, the load balancer became responsive again and
successfully routed it's queries to data node #2!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

James_Smith · July 15, 2013, 9:15pm

Just giving this a bump in case someone sees it and has some thoughts!

I think it looks like a pretty interesting case (particularly the load
balancer failing to balance to the healthy node), but have no clue what
could be causing it...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · July 16, 2013, 6:45am

Hey James,

about #2: when the node became unresponsive, did you encounter any cluster
leave/join operations in that time. It is strange that, even though the one
node was unresponsive, it sounds as if it was still part of the cluster.

about #1: You could call some stats APIs and the hot_threads API in order
to get a clue what is causing the CPU usage. Regarding the open file
handles/inodes, you should use lsof in order to check what kind of files
are open and/or created and held open (must be a creation and deletion from
lots of files, which still have an open file handle - this is the reason
why the usage is gone after the process kill, as the inodes are released).
Do you have any special configuration for your indices/elasticsearch.yml
which is very different to the default configuration?

--Alex

On Mon, Jul 15, 2013 at 11:15 PM, James jamessm87654@gmail.com wrote:

Just giving this a bump in case someone sees it and has some thoughts!

I think it looks like a pretty interesting case (particularly the load
balancer failing to balance to the healthy node), but have no clue what
could be causing it...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Single client node failure renders cluster unresponsive Elasticsearch	2	510	March 14, 2017
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017
Elastic data nodes are randomly getting unresponsive Elasticsearch	1	439	January 11, 2022
Cluster node unresponsive after search Elasticsearch	2	662	July 5, 2017
ES 5.4.1: Totally random cluster stalling (100% CPU) about 1-2 times per day: We're out of ideas Elasticsearch	8	1266	July 21, 2017

Weird cluster failure (unresponsive node + unresponsive load balancer)

Related topics