Elasticsearch high cpu load on snapshot creation

farmean · February 26, 2021, 6:38pm

Hey everyone!
We've got some issue that's making us scratch our heads for a while.
We have pretty straightforward setup, 13 nodes, 1 index, 1000 shards, 2 replicas for each, a lot of nested document, high write load and light read load.
We're collecting snapshots every hour and for some reason during snapshots creation cpu is rocketing to 80%+ load and a lot of segments are getting merged (or so it seems according to our metrics). I also noticed some not-so-typical translog behaviour during this time.
Any ideas are welcome, I think we also can provide some more data if it's needed.

warkolm · February 27, 2021, 12:15am

Welcome to our community!

What version are you on?
How large is the index?
What monitoring are you using?

farmean · February 27, 2021, 9:04am

Hey Mark, thx for welcoming!

We're on 7.8 version.
Index is pretty large, around 2.3Tb with replicas, so around 800 Gb of primary data spread into 1k shards. Each shard is not that big (around 0.8 Gb).
We're monitoring cluster using Releases · justwatchcom/elasticsearch_exporter · GitHub and dashboards in grafana.

I was looking through sources last night and I found that when a snapshot for a shard starts, it flushes it first (SnapshotShardsService::snapshot). What's puzzling me is that shards should be flushed regularly so one more flush should not make much difference, but it does and leads to a lot of segments getting merged.

warkolm · February 27, 2021, 9:06am

That's a real waste of resources and likely the cause of the CPU spike, as it has to prep those 1K shards.

farmean · February 27, 2021, 9:21am

Thank for a quick reply!
What would be the suggestion here? We had like 20 shards last year and it was quickly getting to that 30-50Gb per shard mark where we had to reindex everything. Turns out that having too many is causing issues too. Are we bound to reindex our data from time to time as index getting larger and larger? I think we could use shrinl api for now to scale it down, but the time for reindex will come sooner or later
I have an idea that this whole thing is becoming pretty unmanageable and we better split our data into multiple indices and add new

warkolm · February 27, 2021, 9:22am

What sort of data is it?

farmean · March 24, 2021, 8:24am

Hey, sorry for not replying for a long time!
What characteristics of data do you mean here?
There's a number of root-level documents and each one of them have a lot of nested ones. Text are pretty standard

system · April 21, 2021, 8:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch high CPU on merge threads Elasticsearch	8	2593	July 5, 2017
CPU Issues / Scaling Elasticsearch Elasticsearch	5	482	July 5, 2017
Performance Issues Elasticsearch	3	447	July 6, 2017
Performance degrading after a couple of weeks Elasticsearch	7	520	October 30, 2018
ElasticSearch Performance Elasticsearch	4	348	October 12, 2020

Elasticsearch high cpu load on snapshot creation

Related topics