Issue with a single node in cluster seemingly doing all the bulk indexing!


(Asharif) #1

So we have a 16 node cluster, indexes are spread across 16 shards. Up until last week it seemed to be correctly distributing across all the nodes/shards. The last couple days however it seems like a single node will take turns with most of the bulk indexing threads and the other nodes will have just a few.

running this:

curl -s "localhost:9200/_cat/thread_pool?v&h=host,bulk.max,bulk.size,bulk.active,bulk.queueSize,bulk.queue,bulk.rejected"

with a little bit of bash-fu yields:

host bulk.max bulk.size bulk.active bulk.queueSize bulk.queue bulk.rejected
es1-prod.aerserv.com 100 100 5 500 0 0
es2-prod.aerserv.com 100 100 7 500 0 0
es3-prod.aerserv.com 100 100 6 500 0 0
es4-prod.aerserv.com 100 100 18 500 0 0
es5-prod.aerserv.com 100 100 4 500 0 0
es6-prod.aerserv.com 100 100 100 500 43 0
es7-prod.aerserv.com 100 100 6 500 0 0
es8-prod.aerserv.com 100 100 7 500 0 0
es9-prod.aerserv.com 100 100 6 500 0 0
es10-prod.aerserv.com 100 100 5 500 0 0
es11-prod.aerserv.com 100 100 7 500 0 0
es12-prod.aerserv.com 100 100 6 500 0 0
es13-prod.aerserv.com 100 100 5 500 0 0
es14-prod.aerserv.com 100 100 6 500 0 0
es15-prod.aerserv.com 100 100 21 500 0 0
es16-prod.aerserv.com 100 100 7 500 0 0

see es6-prod as it's stuck at 100 bulk.active. all the other nodes are up and down but none get that high.

any ideas would be appriciated!


(Mark Walkom) #2

How are you sending the bulks to the cluster?
Are you sure the shards are balanced across your nodes?


#3

Hi, I'm working on the same cluster. It's well-behaved when conditions are nominal, but when the size of our main index approaches 3+ billion (daily indices, so toward the end of the day), things start to get out of balance. The bulk thread pool goes crazy on one node, and that node ends up at 99% cpu util and 45+ load. When the day "ticks over" and we start writing to a new, empty index, everything goes back to normal.

How are you sending the bulks to the cluster?

Using the Bulk API via the Java TransportClient, in batches of 5000 events.

Are you sure the shards are balanced across your nodes?

Yes. The cluster is 16 nodes in size, each index is split into 16 shards, and they're evenly distributed.


(system) #4