Unbalanced primary shards affects index performance?

drs · October 26, 2017, 11:22pm

In a previous discussion there's a claim that we shouldn't worry about balancing primary shards across all nodes because primaries and replicas do almost the same amount of work during indexing.

Currently my primaries are highly unbalanced:

I'm having a problem with my bulk index thread queue filling up and tasks being dropped, and the nodes with the most amount of primaries have the most dropped:

http es5-client01.c.fp:9200/_cat/thread_pool | grep bulk | grep data | sort
es5-data01   bulk                8 172 23634
es5-data02   bulk                1   0 13082
es5-data03   bulk                1   0     0
es5-data04   bulk                1   0 10812
es5-data05   bulk                0   0  1112
es5-data06   bulk                0   0  2071
es5-data07   bulk                0   0     0

The fact that some nodes have a non-empty queue and other nodes are showing no active jobs leads me to think that having such an unbalanced cluster must affect indexing performance. Jobs are waiting in queue on on one node when they could be being processed on another if the primaries were more balanced. Am I wrong here?

Igor_Motov · October 28, 2017, 1:35am

Do you have a lot of update operations in these bulk requests?

drs · November 21, 2017, 3:08pm

Sorry for the delay, I didn't get a notification about your response.

On this current job, yes, I am updating documents. I use the elasticsearch-py bulk API with index commands. However, the _ids in the bulk payload already exist so it results in an update.

Igor_Motov · November 21, 2017, 4:02pm

If you index them with the index command there should be no difference between primary and replica loads. Could you run hot threads on the busy node and on a non-busy node and send both here?

drs · November 21, 2017, 5:37pm

Here are the hot threads for the busy node: https://pastebin.com/KQr0Ynrc/?e=1

And here for the non-busy node:

http https://es5-client01.c.fp:9200/_nodes/es5-data03/hot_threads
HTTP/1.1 200 OK
content-encoding: gzip
content-type: text/plain; charset=UTF-8
transfer-encoding: chunked

::: {es5-data03}{yspPeyimQAOzkKy68w14Gg}{T093J3sxQAyqEQsw3d-ZfQ}{10.208.0.128}{10.208.0.128:9300}{ml.max_open_jobs=10, ml.enabled=true}
   Hot threads at 2017-11-21T17:42:29.669Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    6.2% (30.9ms out of 500ms) cpu usage by thread 'elasticsearch[es5-data03][refresh][T#4]'
     10/10 snapshots sharing following 27 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
       java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
       org.elasticsearch.index.IndexWarmer$FieldDataWarmer.lambda$warmReader$2(IndexWarmer.java:164)
       org.elasticsearch.index.IndexWarmer$FieldDataWarmer$$Lambda$2052/1804043570.awaitTermination(Unknown Source)
       org.elasticsearch.index.IndexWarmer.warm(IndexWarmer.java:84)
       org.elasticsearch.index.IndexService.lambda$createShard$1(IndexService.java:343)
       org.elasticsearch.index.IndexService$$Lambda$2014/193368789.warm(Unknown Source)
       org.elasticsearch.index.engine.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1434)
       org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:198)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:160)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
       org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:690)
       org.elasticsearch.index.IndexService.access$400(IndexService.java:92)
       org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:832)
       org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)

Igor_Motov · November 22, 2017, 12:54pm

That's strange. Are you using custom routing, search preferences or scan/scroll searches by any chance? Which version of Elasticsearch is this? Is elasticsearch.yml file on the overloaded node different from other nodes?

drs · November 22, 2017, 1:08pm

This is ES v5.5.2

There is an index on this cluster with parent/child documents, though it was only doing incremental indexing (~10 docs/minute). There was a much heavier indexing job going on that didn't have any custom routing.

elasticsearch.yml is the same across data nodes.

Also, I didn't know it at the time, but we've since discovered there were network problems with our cloud provider that our cluster is deployed to when I grabbed these hot_threads. Once the network issues resolved, the number of docs in the queues went to zero rather quickly.

Igor_Motov · November 22, 2017, 1:12pm

Could you rerun hot_thread again and this time could you run them a few times? Let's say 5 times with 40 seconds interval between runs. Just to make sure it wasn't a fluke?

drs · November 22, 2017, 1:28pm

Sure: https://pastebin.com/nJ8j5C5Y

Igor_Motov · November 22, 2017, 1:45pm

Sorry, I meant to run hot threads on the busy node and on some other node at the same time. So, we can see how they compare.

system · December 20, 2017, 1:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Balancing primary shards Elasticsearch	1	322	July 6, 2017
Distributing primary shards? Elasticsearch	8	9171	December 30, 2016
Unbalanced primary shards affects search queries? Elasticsearch	2	599	February 13, 2020
Rebalancing of shards Elasticsearch	7	3563	July 6, 2017
Is es really load balance? Elasticsearch	3	369	July 6, 2017

Unbalanced primary shards affects index performance?

Related topics