Hi,
I'm working on a small 7.15 cluster (8 data nodes named data-node-X), and for some time, I've observed an abnormal behaviour I can't find the origin of.
Indeed, all new rollovers of major indexes (ILM managed) are done on data-node-4, which receives therefore almost all the cluster indexing traffic from my 8 inserters (logstash instances on containers).
My cluster became unable to cope up as 7 nodes are merely doing nothing and one taking the whole job on its shoulders.
I decided to add some routing on my indexes templates, which made the cluster stable again, but it clearly is a duct tape, as the rollovers go back on data-node-4 as soon as I remove the routing.
My question is: how can I debug this?
Some architecture information:
- Each data node is paired with another on a physical node: node 1 and 2 are on the same host, 3 and 4 on another host, and so on.
- Each node has 10 dedicated 1.8 TB disks, exposed through MDP (so, data directories on which the disks are mounted).
- Indexes are ILM-managed, and the templates (legacy templates btw) are currently containing routing allocation directives to avoid all new rollovers to be on the same node.
- All ILM policies rely on a single hot phase, and then data are removed (after 16 to 30 days depending on the policies)
- Despite the routing allocation hints containing each time 4 nodes, any new rollover is always done on the same node (but no longer data-node-4 for those where it is not included)
- There is a correlation with our 7.15 upgrade, but I can't be sure it's really connected.
Thanks in advance for your hints or questions!