One shard getting more load


#1

Hi

We are seeing a strange behaviour were all servers with shard 1 is getting more load than the others.

We are running a cluster of 18 machines (on AWS) it only one index with 5 shards in the cluster. There is no special routing of documents used.

Any pointers on how to debug this would be greatly appreciated !


(Anishek) #2

We are facing similar issue where some nodes are more heavily loaded than others given that all the nodes have all shards required to serve the query, We are using ES mostly for percolation and have 3 primaries and 8 replicas for 9 node machines.

Still we see certain nodes at 80% usage and others at 5-10%. I am planning to use _preference=_local to see if that will help. Will let you know if it helps.


#3

Thanks for your answer, i don't think "preference=local" will help us as each node only has one shard.

Does anyone know how to check which node acted as the "coordinating node" for a specific search ? We are using Transport client connection to an ELB with all nodes in the ELB, the only thing I can think of is if some nodes are "coordinating node" more than others and that would generate more load.


(Anishek) #4

unless the elb is sending requests to the overloaded machines coordination should not increase the load too much on specific machines. is there a way ELB logs provide logs for redistribution of requests across nodes.

I tried the preference=local option that did not help us. By default the routing for us is round robin on the transport client, setting the "client.transport.sniff" property to true. we also enabled trace logs on the transport client for sometime to see the distribution and it goes pretty well in a round robin fashion across nodes we have. Still certain nodes are always heavily loaded.


#5

Thanks for the help, luckily it solved itself...
Look like elastic did a "big" merge of segments (about 20 % less storage used after). After that we see even load on all nodes. Strange that it waited so long, this has been a problem for > 2 weeks.


(system) #6