Hi all,
We have recently experienced a weird failure whereby one node became
unresponsive (maxed inode table?), but then even more worryingly, the 'no
data' load balancer node didn't balance the traffic off to the remaining
healthy node. Instead it became unresponsive itself.
As soon as we killed the unresponsive data node, the load balancer starting
responding and using the healthy node.
If anyone has an idea of what this could have been, I would greatly
appreciate your input!
Details below...
Cheers, James.
====
Architecture:
2 ElasticSearch data nodes behind an ElasticSearch 'non data' node, acting
as a load balancer.
Both nodes contains all shards.
ElasticSearch version 20.2 on all nodes.
Linux / ext4 filesystem.
====
Problem 1: ElasticSearch data node #1 became unresponsive to queries for 30
mins, and we eventually killed the process.
We believe that over 3 hours or so, it used up and finally maxed out it's
inode table.
Before and during unresponsive period:
- Disk io: high
- Disk usage: elasticsearch directory gradually filling (got to 90%),
rather than staying stable - Number of threads: climbing quickly
- Cpu usage: high
- Load average: high
- Inode table usage: climbing in an arc and then maxing out, I believe
- Memory usage: normal
Once process was killed:
- Disk io: almost 0
- Disk usage: remaining high, but stable
- Number of threads: low and stable
- Cpu usage: almost 0
- Load average: almost 0
- Inode table usage: flat near maximum
- Memory usage: normal
After process was restarted:
- Disk io: normal
- Disk usage: cleaned up files and back to normal amount
- Number of threads: low and stable
- Cpu usage: normal
- Load average: normal
- Inode table usage: cleaned up; low and stable
- Memory usage: normal
====
Problem 2: Whilst ElasticSearch data node #1 was unresponsive, the 'non
data' load balancer node did not balance the traffic off to the healthy
data node #2. It did not even respond to 50% of queries. Instead, it became
fully unresponsive itself.
At this point, the load balancer and node #1 were fully unresponsive, but
node #2 was fully responsive.
Once data node #1 was killed, the load balancer became responsive again and
successfully routed it's queries to data node #2!
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.