Indexing heavy lifting not well distributed

rodaj · January 5, 2018, 8:54pm

Environment:
Rancher 1.6.10, Docker 1.12.*, ElasticSearch 5.5.1 (without x-pack).
15 es data nodes across a 5 rancher node (each rancher node is baremetal server).
3 es data nodes per rancher node in 3 rancher "services" of 5. Each es data node stores data on dedicated local fs of each rancher node.
3 master nodes currently on rancher nodes 1,4,5
each rancher node is setup as a separate "zone" in ES to ensure ES cluster resiliency if a single rancher node should crash.
rancher lb proxy on each rancher node pointing to 1 of 2 nginx web servers currently on rancher nodes 1,2.
nginx web servers point to the 3 ES master nodes (load has been such that client ES nodes haven't appeared to be necessary...yet)
A user has submitted 6 indexing jobs against rancher lb proxy on rancher node 2.

Here is the part that makes no sense. The 3 data nodes on rancher node 2 appear to be doing all in the cpu intensive indexing rather then being distributed across all the data nodes as one might expect the ES master nodes would do after receiving the bulk indexing requests from one of the two nginx web servers. The "zone" identities happen to be same as the bare metal server short name. It seems awful coincidental that the user targeted host name into the cluster just so happens to also be the same rancher node where all the ES data nodes are doing all the cpu intensive indexing...as if somehow it was restricted to "zone". Note I have verified via _cat/shards that the 6 indexes actively being processed have their shards "properly" spread across the 15 es data nodes on all 5 of the rancher nodes.
Why is all the cpu being consumed on a single rancher node restricted to 3 es data nodes??? This is a disconcerting condition that could result in overload of system resources. I have not found any documentation that would indicate this condition is even possible with the given configuration. i.e we are NOT restricting indexed data to a SINGLE zone.

rodaj · January 8, 2018, 9:17pm

A subsequent test with 2 indexing jobs coming in on 4 of the 5 rancher nodes results in the same condition where rancher node 2 containing 3 ES data nodes is doing all the cpu work. Therefore, it appears the original host destination does in prior observation was coincidental and not a factor in whatever ES is doing to decide where indexing work is directed to.

rodaj · January 8, 2018, 10:08pm

Ok, re-read https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html Forced Awareness section a few times. I'm starting to believe based on the documented example of discussion, that having 5 zones (1 per rancher node) is hurting indexing capacity. Can someone from Elastic.co confirm that when using "Forced Awareness" indexing logic is restricted to a single zone? i.e. Elastic doesn't have any logic to "coordinate" proper zone awareness for shard placement between multiple zones during indexing.

system · February 5, 2018, 10:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.