Moving shards is slow

This question is for ES version 6.1.2

I have a 30Gb index split into 60shards (about 500megs each). I turned off allocation,

"cluster.routing.allocation.enable": "none"

I send move commands to move 6 shards. It is coming up to an hour and they are still "moving".

Here is my config:

node.name: 1
node.master: false
node.data: true
node.ingest: true
node.attr.zone: primary

cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.enable: none
cluster.routing.allocation.cluster_concurrent_rebalance: 1
cluster.routing.allocation.node_concurrent_recoveries: 1

thread_pool.bulk.queue_size: 3000
thread_pool.index.queue_size: 1000

indices.recovery.max_bytes_per_sec: 1000mb

Anyone have suggestions?

Thank you

2 Likes

Here is a chart of network bytes sent for the machine I am moving shards off of.

image

Could the "move" be actually copying from the node with the primary shard, rather than copying from the node that the shard is on? One of my nodes has all the primary shards, and it appears busy.

Recovering, and moving shards is now slow by design. https://github.com/elastic/elasticsearch/issues/29436#issuecomment-379934742

Recovering, and moving shards is now slow by design.

@klahnakoski that tone of voice doesn't buy you much love.

I send move commands to move 6 shards. It is coming up to an hour and they are still "moving".

Can you post the output of the recovery api: GET _recovery?active_only=true&human ?

My apologies for that tone, my emotions got the best of me.

Since this discussion started, I bounced the node that had all the primary shards so that the primaries would distribute a bit. The primaries are still "clumping", but recovery is faster. This is from our new cluster, it has about a 1/4 sample from our existing cluster.

Given things have changed, I do not know if _recovery will provide you with any information, but here it is:

{
	"unittest20180325_000000":{"shards":[{
		"id":12,
		"type":"PEER",
		"stage":"INDEX",
		"primary":false,
		"start_time":"2018-04-10T10:47:01.290Z",
		"start_time_in_millis":1523357221290,
		"total_time":"1.5m",
		"total_time_in_millis":92397,
		"source":{
			"id":"OpYrgkLLR6W_GZsML08Kzw",
			"host":"172.31.1.41",
			"transport_address":"172.31.1.41:9300",
			"ip":"172.31.1.41",
			"name":"spot_172.31.1.41"
		},
		"target":{
			"id":"fB2lIO3AShmS_Mu7_db6kw",
			"host":"172.31.1.133",
			"transport_address":"172.31.1.133:9300",
			"ip":"172.31.1.133",
			"name":"spot_172.31.1.133"
		},
		"index":{
			"size":{
				"total":"5.1gb",
				"total_in_bytes":5553993709,
				"reused":"0b",
				"reused_in_bytes":0,
				"recovered":"1.5gb",
				"recovered_in_bytes":1651615562,
				"percent":"29.7%"
			},
			"files":{"total":117,"reused":0,"recovered":110,"percent":"94.0%"},
			"total_time":"1.5m",
			"total_time_in_millis":92392,
			"source_throttle_time":"0s",
			"source_throttle_time_in_millis":0,
			"target_throttle_time":"0s",
			"target_throttle_time_in_millis":0
		},
		"translog":{
			"recovered":0,
			"total":0,
			"percent":"100.0%",
			"total_on_start":0,
			"total_time":"0s",
			"total_time_in_millis":0
		},
		"verify_index":{
			"check_index_time":"0s",
			"check_index_time_in_millis":0,
			"total_time":"0s",
			"total_time_in_millis":0
		}
	}]},
	"unittest20180401_000000":{"shards":[
		{
			"id":29,
			"type":"PEER",
			"stage":"INDEX",
			"primary":true,
			"start_time":"2018-04-10T10:44:16.083Z",
			"start_time_in_millis":1523357056083,
			"total_time":"4.2m",
			"total_time_in_millis":257607,
			"source":{
				"id":"tGN6xh4EQe-cMdEm0G3ZCg",
				"host":"172.31.1.161",
				"transport_address":"172.31.1.161:9300",
				"ip":"172.31.1.161",
				"name":"spot_172.31.1.161"
			},
			"target":{
				"id":"leqOW1llRq-7s_amdj9rhA",
				"host":"172.31.1.235",
				"transport_address":"172.31.1.235:9300",
				"ip":"172.31.1.235",
				"name":"spot_172.31.1.161"
			},
			"index":{
				"size":{
					"total":"5.1gb",
					"total_in_bytes":5489771388,
					"reused":"0b",
					"reused_in_bytes":0,
					"recovered":"4.9gb",
					"recovered_in_bytes":5278959664,
					"percent":"96.2%"
				},
				"files":{"total":129,"reused":0,"recovered":128,"percent":"99.2%"},
				"total_time":"4.2m",
				"total_time_in_millis":256860,
				"source_throttle_time":"0s",
				"source_throttle_time_in_millis":0,
				"target_throttle_time":"0s",
				"target_throttle_time_in_millis":0
			},
			"translog":{
				"recovered":0,
				"total":6618,
				"percent":"0.0%",
				"total_on_start":6565,
				"total_time":"0s",
				"total_time_in_millis":0
			},
			"verify_index":{
				"check_index_time":"0s",
				"check_index_time_in_millis":0,
				"total_time":"0s",
				"total_time_in_millis":0
			}
		},
		{
			"id":4,
			"type":"PEER",
			"stage":"INDEX",
			"primary":true,
			"start_time":"2018-04-10T10:35:58.794Z",
			"start_time_in_millis":1523356558794,
			"total_time":"12.5m",
			"total_time_in_millis":754889,
			"source":{
				"id":"rNtDq87WQhWaHsKlbZRPzw",
				"host":"172.31.1.76",
				"transport_address":"172.31.1.76:9300",
				"ip":"172.31.1.76",
				"name":"spot_172.31.1.76"
			},
			"target":{
				"id":"eLdwRD2RR7SmkJWkelrdhw",
				"host":"172.31.1.163",
				"transport_address":"172.31.1.163:9300",
				"ip":"172.31.1.163",
				"name":"spot_172.31.1.76"
			},
			"index":{
				"size":{
					"total":"5gb",
					"total_in_bytes":5469790162,
					"reused":"0b",
					"reused_in_bytes":0,
					"recovered":"5gb",
					"recovered_in_bytes":5391906410,
					"percent":"98.6%"
				},
				"files":{"total":124,"reused":0,"recovered":123,"percent":"99.2%"},
				"total_time":"12.5m",
				"total_time_in_millis":753338,
				"source_throttle_time":"213.5ms",
				"source_throttle_time_in_millis":213,
				"target_throttle_time":"0s",
				"target_throttle_time_in_millis":0
			},
			"translog":{
				"recovered":0,
				"total":17026,
				"percent":"0.0%",
				"total_on_start":16983,
				"total_time":"0s",
				"total_time_in_millis":0
			},
			"verify_index":{
				"check_index_time":"0s",
				"check_index_time_in_millis":0,
				"total_time":"0s",
				"total_time_in_millis":0
			}
		}
	]}
}

I think the only solution is to manually balance the primary shards across the cluster so that recovery load is distributed when it is needed. This may take the form of a daemon that will swap a primary of one shard, for a replica of another to maintain overall balance. Also, being deliberate about node startup order may coerce shards to be marked primary "correctly".

Let's get one thing out of the way :smiley:

Elasticsearch doesn't try to balance primaries. The reason is that moving shards around is a heavy process and Primary typically do exactly the same work as replicas. Currently updates are different but updates should be fixed. Replica / relocation and recovery is a rare thing and amounts to the same work that moving the primary (to balance) will do. Promoting a replica to a primary while demoting the current primary to a replica is another bag of trouble. Long story short, the prices of balancing primaries is just too high compared to the benefits. Note that new primaries are balanced, but events like node restarts change this.

I wanted to see things about ongoing recoveries (like throttling etc), so the output is not that helpful. That said your shards are very small (5gb), so they should move relatively fast. Also, I wonder why you have so many shards.

I did not mean to imply Elasticsearch should balance primaries, I meant that I should be doing it manually (or with an external daemon).

My particular configuration has two zones: One zone is for "backup", which are weak machines with large drives, and the other is the "spot" zone with powerful machines and faster drives. The spot zone is a bit unstable, but it provides us with 10x the processing power for a given price point. Stability in the spot zone is enhanced by distributing over many machine types at a variety of price points, and nodes get plenty of time to shutdown when they are about to be reclaimed. Elasticsearch 1.7 is awesome in this configuration! It handles the shard replication and distribution on top of an ever changing node environment.

Our use case is a few users querying billions records covering our test results, CI metadata, and data extracted from a few other reasonably large systems. Data consistency is nice-to-have, but at our scale data is lost all the time. The CI writes partial files, CI has bugs, tests are failing, S3 times out, schemas are changing, the ETL pipeline does not handle corner cases, etc. Losses that Elasticsearch introduce are relatively tiny. The whole system is statistically accurate for our needs. Even if we loose live data, we have the original datasources to fall back on. Elasticsearch 1.7 is awesome for this use case! We need not manage the changing schemas, and queries are significantly faster than any database, or big data solution, for the given hardware (not including the 10x savings from spot nodes)

So, we have had wonderful success with Elasticsearch 1.7 for over two years! Now it is time to get even more benefit from an upgrade! Even if (manually) moving primaries is expensive, the orders-of-magnitude benefit we get from Elasticsearch will be worth it, as long as it can recover shards in a timely manner.

The 5gb shards you see are on the new ESv6.1 cluster, they are just small samples of production data while we run scaling tests. In production we create a 60shard index each week to store 2T of test data. We target 20gb shards (+/- 10gb) so we can move shards quickly for inevitable recovery. We also limit our concurrent recoveries so that shards move as fast as possible and accept query load asap; it is better to move 1 shard in 1 minute than 2 shards in 2 minutes because the former gives us a shard to query sooner.

Happy to hear that. My point is that I can't think of anything that would explain this (1.7 and 0.20 also copied from the primary when moving a replica), so I want to dig deeper to potentially find the route cause.

The cluster is currently moving shards back to a node in the "backup" zone [below]. The "network in" chart hints that there is incoming data, but the amount of network is much higher than the size of the shards would indicate. Indexing is on, so maybe that is why it is slow.

image

Just to be sure I am reading this correctly:

	"translog":{
		"recovered":684716,
		"total":684716,
		"percent":"100.0%",
		"total_on_start":684384,
		"total_time_in_millis":85240
	}

means it took 852seconds (14minutes) to transfer 684K? That seems about right. :frowning:

I will stop sending bulk requests and provide another sample shortly.

{"unittest20180408_000000":{"shards":[
	{
		"id":27,
		"type":"PEER",
		"stage":"TRANSLOG",
		"primary":false,
		"start_time_in_millis":1523371000111,
		"total_time_in_millis":230578,
		"source":{
			"id":"3G_WPjbmRBqpAOstJWsvgw",
			"host":"172.31.1.129",
			"transport_address":"172.31.1.129:9300",
			"ip":"172.31.1.129",
			"name":"spot_172.31.1.129"
		},
		"target":{
			"id":"bpv3UtwOS5aQGQhzJsd_Iw",
			"host":"172.31.1.240",
			"transport_address":"172.31.1.240:9300",
			"ip":"172.31.1.240",
			"name":"1"
		},
		"index":{
			"size":{
				"total_in_bytes":1676683149,
				"reused_in_bytes":0,
				"recovered_in_bytes":1676683149,
				"percent":"100.0%"
			},
			"files":{"total":94,"reused":0,"recovered":94,"percent":"100.0%"},
			"total_time_in_millis":21698,
			"source_throttle_time_in_millis":0,
			"target_throttle_time_in_millis":0
		},
		"translog":{
			"recovered":519196,
			"total":696154,
			"percent":"74.6%",
			"total_on_start":696139,
			"total_time_in_millis":208874
		},
		"verify_index":{"check_index_time_in_millis":0,"total_time_in_millis":0}
	},
	{
		"id":11,
		"type":"PEER",
		"stage":"FINALIZE",
		"primary":true,
		"start_time_in_millis":1523369104601,
		"total_time_in_millis":2126088,
		"source":{
			"id":"3G_WPjbmRBqpAOstJWsvgw",
			"host":"172.31.1.129",
			"transport_address":"172.31.1.129:9300",
			"ip":"172.31.1.129",
			"name":"spot_172.31.1.240"
		},
		"target":{
			"id":"bpv3UtwOS5aQGQhzJsd_Iw",
			"host":"172.31.1.240",
			"transport_address":"172.31.1.240:9300",
			"ip":"172.31.1.240",
			"name":"1"
		},
		"index":{
			"size":{
				"total_in_bytes":1668127602,
				"reused_in_bytes":0,
				"recovered_in_bytes":1668127602,
				"percent":"100.0%"
			},
			"files":{"total":95,"reused":0,"recovered":95,"percent":"100.0%"},
			"total_time_in_millis":49145,
			"source_throttle_time_in_millis":0,
			"target_throttle_time_in_millis":0
		},
		"translog":{
			"recovered":684716,
			"total":684716,
			"percent":"100.0%",
			"total_on_start":684384,
			"total_time_in_millis":852401
		},
		"verify_index":{"check_index_time_in_millis":0,"total_time_in_millis":0}
	}
]}}

I turned off the bulk indexing, and there are no clients running queries, only the shard should be moving now. The inbound network usage is high

image

CPU went up, which is odd, I would have expected it to go down now that no indexing is being done.

image

Yup, it is ES

image

The shard movement does not seem to be going faster

{"unittest20180408_000000":{"shards":[
	{
		"id":16,
		"type":"PEER",
		"stage":"TRANSLOG",
		"primary":false,
		"start_time_in_millis":1523374645798,
		"total_time_in_millis":128170,
		"source":{
			"id":"bpv3UtwOS5aQGQhzJsd_Iw",
			"host":"172.31.1.240",
			"transport_address":"172.31.1.240:9300",
			"ip":"172.31.1.240",
			"name":"1"
		},
		"target":{
			"id":"rNtDq87WQhWaHsKlbZRPzw",
			"host":"172.31.1.76",
			"transport_address":"172.31.1.76:9300",
			"ip":"172.31.1.76",
			"name":"spot_172.31.1.76"
		},
		"index":{
			"size":{
				"total_in_bytes":1779196762,
				"reused_in_bytes":0,
				"recovered_in_bytes":1779196762,
				"percent":"100.0%"
			},
			"files":{"total":82,"reused":0,"recovered":82,"percent":"100.0%"},
			"total_time_in_millis":57971,
			"source_throttle_time_in_millis":19,
			"target_throttle_time_in_millis":0
		},
		"translog":{
			"recovered":70592,
			"total":91654,
			"percent":"77.0%",
			"total_on_start":91654,
			"total_time_in_millis":70155
		},
		"verify_index":{"check_index_time_in_millis":0,"total_time_in_millis":0}
	},
	{
		"id":51,
		"type":"PEER",
		"stage":"FINALIZE",
		"primary":true,
		"start_time_in_millis":1523373837498,
		"total_time_in_millis":936503,
		"source":{
			"id":"tGN6xh4EQe-cMdEm0G3ZCg",
			"host":"172.31.1.161",
			"transport_address":"172.31.1.161:9300",
			"ip":"172.31.1.161",
			"name":"spot_172.31.1.161"
		},
		"target":{
			"id":"bpv3UtwOS5aQGQhzJsd_Iw",
			"host":"172.31.1.240",
			"transport_address":"172.31.1.240:9300",
			"ip":"172.31.1.240",
			"name":"1"
		},
		"index":{
			"size":{
				"total_in_bytes":1767363486,
				"reused_in_bytes":0,
				"recovered_in_bytes":1767363486,
				"percent":"100.0%"
			},
			"files":{"total":88,"reused":0,"recovered":88,"percent":"100.0%"},
			"total_time_in_millis":25267,
			"source_throttle_time_in_millis":0,
			"target_throttle_time_in_millis":0
		},
		"translog":{
			"recovered":709978,
			"total":709978,
			"percent":"100.0%",
			"total_on_start":709978,
			"total_time_in_millis":450217
		},
		"verify_index":{"check_index_time_in_millis":0,"total_time_in_millis":0}
	},
	{
		"id":54,
		"type":"PEER",
		"stage":"TRANSLOG",
		"primary":false,
		"start_time_in_millis":1523374639051,
		"total_time_in_millis":134950,
		"source":{
			"id":"leqOW1llRq-7s_amdj9rhA",
			"host":"172.31.1.235",
			"transport_address":"172.31.1.235:9300",
			"ip":"172.31.1.235",
			"name":"spot_172.31.1.235"
		},
		"target":{
			"id":"bpv3UtwOS5aQGQhzJsd_Iw",
			"host":"172.31.1.240",
			"transport_address":"172.31.1.240:9300",
			"ip":"172.31.1.240",
			"name":"1"
		},
		"index":{
			"size":{
				"total_in_bytes":1812624877,
				"reused_in_bytes":0,
				"recovered_in_bytes":1812624877,
				"percent":"100.0%"
			},
			"files":{"total":117,"reused":0,"recovered":117,"percent":"100.0%"},
			"total_time_in_millis":24835,
			"source_throttle_time_in_millis":0,
			"target_throttle_time_in_millis":22
		},
		"translog":{
			"recovered":187122,
			"total":720866,
			"percent":"26.0%",
			"total_on_start":720866,
			"total_time_in_millis":110098
		},
		"verify_index":{"check_index_time_in_millis":0,"total_time_in_millis":0}
	}
]}}

Network inbound continued to stay high on the backup node, with little shard movement. I restarted the backup node, and activity ceased. My next step is to resume moving shards back to this backup node

shutdown backup node, brought it back up, shut it down, upgraded to 6.2.3, started node

Still slow

{"unittest20180408_000000":{"shards":[{
	"id":10,
	"type":"PEER",
	"stage":"FINALIZE",
	"primary":true,
	"start_time_in_millis":1523381732042,
	"total_time_in_millis":498246,
	"source":{
		"id":"T2ydujSnQPC-8oew1a2XLg",
		"host":"172.31.1.170",
		"transport_address":"172.31.1.170:9300",
		"ip":"172.31.1.170",
		"name":"spot_172.31.1.170"
	},
	"target":{
		"id":"bpv3UtwOS5aQGQhzJsd_Iw",
		"host":"172.31.1.240",
		"transport_address":"172.31.1.240:9300",
		"ip":"172.31.1.240",
		"name":"1"
	},
	"index":{
		"size":{
			"total_in_bytes":1814208447,
			"reused_in_bytes":0,
			"recovered_in_bytes":1814208447,
			"percent":"100.0%"
		},
		"files":{"total":119,"reused":0,"recovered":119,"percent":"100.0%"},
		"total_time_in_millis":37732,
		"source_throttle_time_in_millis":0,
		"target_throttle_time_in_millis":0
	},
	"translog":{
		"recovered":722193,
		"total":722193,
		"percent":"100.0%",
		"total_on_start":722193,
		"total_time_in_millis":281660
	},
	"verify_index":{"check_index_time_in_millis":0,"total_time_in_millis":0}
}]}}

Currently, I hope this slowdown is due to setting, and not due to the imbalance in the primaries. Even moving single shards is slow.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.