Index speed is greatly reduced when doing replication

shjdwxy · March 25, 2019, 4:12am

hi, guys
We use ES for log management and ES cluster is built by hot-cold architecture. One physical host holds one hot node and one cold node. Hot node and cold node share Cpu, memory, but use different storage( ssd for hot node, sata for cold node).
Logs of today are bulked to hot node and replica of index is set to 0. Logs of yesterday were moved to cold node. After the move was finished, the replica of index was set to 1. But we found that when doing replication on cold node, the index speed is greatly reduced. Load of physical host is not high and We checked the thread pool and found that lots of bulk request were rejected.
When index speed was impacted, we set cluster.routing.allocation.node_concurrent_recoveries to 0, the index speed would recover.
Any suggestions?

Christian_Dahlqvist · March 25, 2019, 6:11am

Which version of Elasticsearch are you using? What is the specification of the hosts with respect to CPU, RAM and network? How large are your shards? How many nodes do you have in the cluster?

shjdwxy · March 25, 2019, 9:04am

Es version: 5.4.3
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Memory: 256GB
network: 10GB*2
disk: 1.6T ssd * 4, 8T * 12 sata
43 nodes in cluster: 20 hot, 20 cold, 3 master
logs index daily: 30TB/day

DavidTurner · March 25, 2019, 9:51am

The replication of yesterday's index shouldn't interact at all with today's indexing within Elasticsearch since they're on separate nodes. I therefore think that this interaction is via some other channel. The nodes share CPU and memory, and also both IO and network bandwidth, as well as other resources like filesystem cache. I think that at least one of these is suffering from contention.

I think I would start by looking harder for IO contention. Even though the nodes don't share storage they will be sharing other parts of their IO subsystems.

You say you reduced cluster.routing.allocation.node_concurrent_recoveries to 0. What was it set to previously? Do you still see problems if it's set to 1? Also, you can control the rate of recovery with indices.recovery.max_bytes_per_sec. Have you set this high enough that it could be slowing down indexing? If you set it lower does it help?

shjdwxy · March 25, 2019, 11:05am

thanks for your reply
my es settings:
cluster.routing.allocation.node_concurrent_recoveries: 1
indices.recovery.max_bytes_per_sec: 400mb

So these settings are already at low level.

I am wandering "cold node export http port so it also serve bulk request, so will bulk request through cold node be affected when cold node doing replication"?

DavidTurner · March 25, 2019, 11:37am

I think you should avoid the cold nodes when performing your bulk indexing in order to be sure there's no interaction. However I don't immediately see how this could be causing the effect you describe. Are the bulk rejections happening on the cold nodes or on the hot nodes?

400MB/s seems quite high - that's over 3Gbps, which could well be enough to saturate your IO bandwidth. Try reducing this to see if that helps.

shjdwxy · March 26, 2019, 2:21am

thanks
I will try to reduce indices.recovery.max_bytes_per_sec to see if this helps

system · April 23, 2019, 2:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch replication performance is slow Elasticsearch	4	1295	February 7, 2017
Different number of nodes/replicas/shards doesnt change performance Elasticsearch	10	727	July 5, 2017
Cold storage nodes config Elasticsearch	9	2260	July 5, 2017
How to diagnose when indexing performance suddenly droped Elasticsearch	3	536	June 11, 2020
Indexing rate performance in cluster Elasticsearch	6	3759	July 5, 2017

Index speed is greatly reduced when doing replication

Related topics