Weird cluster failure (unresponsive node + unresponsive load balancer)

Hi all,

We have recently experienced a weird failure whereby one node became
unresponsive (maxed inode table?), but then even more worryingly, the 'no
data' load balancer node didn't balance the traffic off to the remaining
healthy node. Instead it became unresponsive itself.

As soon as we killed the unresponsive data node, the load balancer starting
responding and using the healthy node.

If anyone has an idea of what this could have been, I would greatly
appreciate your input!

Details below...

Cheers, James.

====

Architecture:

2 ElasticSearch data nodes behind an ElasticSearch 'non data' node, acting
as a load balancer.
Both nodes contains all shards.
ElasticSearch version 20.2 on all nodes.
Linux / ext4 filesystem.

====

Problem 1: ElasticSearch data node #1 became unresponsive to queries for 30
mins, and we eventually killed the process.

We believe that over 3 hours or so, it used up and finally maxed out it's
inode table.

Before and during unresponsive period:

  • Disk io: high
  • Disk usage: elasticsearch directory gradually filling (got to 90%),
    rather than staying stable
  • Number of threads: climbing quickly
  • Cpu usage: high
  • Load average: high
  • Inode table usage: climbing in an arc and then maxing out, I believe
  • Memory usage: normal

Once process was killed:

  • Disk io: almost 0
  • Disk usage: remaining high, but stable
  • Number of threads: low and stable
  • Cpu usage: almost 0
  • Load average: almost 0
  • Inode table usage: flat near maximum
  • Memory usage: normal

After process was restarted:

  • Disk io: normal
  • Disk usage: cleaned up files and back to normal amount
  • Number of threads: low and stable
  • Cpu usage: normal
  • Load average: normal
  • Inode table usage: cleaned up; low and stable
  • Memory usage: normal

====

Problem 2: Whilst ElasticSearch data node #1 was unresponsive, the 'non
data' load balancer node did not balance the traffic off to the healthy
data node #2. It did not even respond to 50% of queries. Instead, it became
fully unresponsive itself.

At this point, the load balancer and node #1 were fully unresponsive, but
node #2 was fully responsive.

Once data node #1 was killed, the load balancer became responsive again and
successfully routed it's queries to data node #2!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just giving this a bump in case someone sees it and has some thoughts!

I think it looks like a pretty interesting case (particularly the load
balancer failing to balance to the healthy node), but have no clue what
could be causing it...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey James,

about #2: when the node became unresponsive, did you encounter any cluster
leave/join operations in that time. It is strange that, even though the one
node was unresponsive, it sounds as if it was still part of the cluster.

about #1: You could call some stats APIs and the hot_threads API in order
to get a clue what is causing the CPU usage. Regarding the open file
handles/inodes, you should use lsof in order to check what kind of files
are open and/or created and held open (must be a creation and deletion from
lots of files, which still have an open file handle - this is the reason
why the usage is gone after the process kill, as the inodes are released).
Do you have any special configuration for your indices/elasticsearch.yml
which is very different to the default configuration?

--Alex

On Mon, Jul 15, 2013 at 11:15 PM, James jamessm87654@gmail.com wrote:

Just giving this a bump in case someone sees it and has some thoughts!

I think it looks like a pretty interesting case (particularly the load
balancer failing to balance to the healthy node), but have no clue what
could be causing it...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.