Shards getting bigger with updates (same number of documents)

marcineq · February 15, 2021, 3:31pm

Hi all

I have a ES 7 cluster of couple of EC2 instances, the index has 6 shards and each shard has 2 replicas (so 1+2 x 6 = 18 shards for an index). When I create the index each shard size is around 25-30gb and we hold around 3mln of records in the database. We have a bit of updates happening everyday, let's say it's around 1mln, the update means the record gets replaced by a new one but the ID stays the same - we have pretty much the same amount of documents. I've noticed that after couple of weeks the shard size grows to 50gb so nearly double the size. Could someone please explain to me why this is happening and how can I fix it? (or should I fix it?) I've noticed search performance going down when we reach 50gb shards. Any comments/help would be highly appreciated.

Thanks

marcineq

warkolm · February 15, 2021, 10:41pm

Welcome to our community!

What is the output from the _cat/indices?v API?

marcineq · February 15, 2021, 11:23pm

Thanks Mark

    index_name                    2     p      STARTED 46750788    50gb 11.11.11.111 ip-11.11.11.111-es
    index_name                    2     r      STARTED 46750788  49.3gb 22.22.22.222 ip-22.22.22.222-es
    index_name                    2     r      STARTED 46750788  44.4gb 33.33.33.333 ip-33.33.33.333-es
    index_name                    1     p      STARTED 46532522  47.9gb 44.44.44.444 ip-44.44.44.444-es
    index_name                    1     r      STARTED 46532522  52.7gb 55.55.55.555 ip-55.55.55.555-es
    index_name                    1     r      STARTED 46532522    49gb 66.66.66.666 ip-66.66.66.666-es
    index_name                    3     r      STARTED 46677577    52gb 11.11.11.111 ip-11.11.11.111-es
    index_name                    3     p      STARTED 46677577  47.5gb 55.55.55.555 ip-55.55.55.555-es
    index_name                    3     r      STARTED 46677577  44.4gb 77.77.77.777 ip-77.77.77.777-es
    index_name                    5     p      STARTED 46736104  50.8gb 88.88.88.888 ip-88.88.88.888-es
    index_name                    5     r      STARTED 46736104  52.8gb 99.99.99.999 ip-99.99.99.999-es
    index_name                    5     r      STARTED 46736104    48gb 66.66.66.666 ip-66.66.66.666-es
    index_name                    4     p      STARTED 46660338  45.7gb 77.77.77.777 ip-77.77.77.777-es
    index_name                    4     r      STARTED 46660338  49.6gb 88.88.88.888 ip-88.88.88.888-es
    index_name                    4     r      STARTED 46660338  46.8gb 99.99.99.999 ip-99.99.99.999-es
    index_name                    0     r      STARTED 46504385    43gb 44.44.44.444 ip-44.44.44.444-es
    index_name                    0     r      STARTED 46504385  53.3gb 22.22.22.222 ip-22.22.22.222-es
    index_name                    0     p      STARTED 46504385    51gb 33.33.33.333 ip-33.33.33.333-es

warkolm · February 15, 2021, 11:31pm

That doesn't look aligned with what you are suggesting, there's no deleted documents showing.

marcineq · February 15, 2021, 11:42pm

Sorry, maybe I explained it incorrectly - if I have document with _id x, an update comes it, es.index is performed with an _id x which replaces the doc x which already exists in the ES db. This operation happens to around 1mln records per day, I have 3mln records in total.

warkolm · February 15, 2021, 11:43pm

You explained it correctly, but that index is not showing any deleted documents based on the output you provided.

marcineq · February 15, 2021, 11:46pm

Is that based on the same doc count for the shards? the updates happened in the morning and everything is up to date now across primaries/replicas.

warkolm · February 15, 2021, 11:56pm

It's based on what I can see from the output from the _cat command you ran. By default, it should show the number of deleted docs directly after the number of docs. There's nothing there though?

What version are you on?

marcineq · February 15, 2021, 11:57pm

7.3.2

warkolm · February 15, 2021, 11:57pm

And you run _cat/indices?v, exactly that?

marcineq · February 16, 2021, 12:01am

Sorry I was looking at the wrong thing - it's getting late now, here you go:

health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index_name 23BMWdfBQKukF5AKjORnkA   6   2  279861774     99922075    833.8gb        268.3gb

Christian_Dahlqvist · February 16, 2021, 6:13am

Elasticsearch does not perform in place updates. Instead data is stored in immutable segments, so updating documents generates new additional segments that take up additional space and the data that was updated is not immediately deleted. It is not until segments are merged in the background that updated documents are removed from disk and this is triggered when the amount of updated documents in a segment exceeds a threshold. Having an index increase in size while updating is therefore expected.

marcineq · February 16, 2021, 8:18am

That makes sense, what should be done in this case? Should I increase the number of shards so that I don't get into a situation where the shard gets over the recommended size? Is there a way to trigger a merge?

Christian_Dahlqvist · February 16, 2021, 8:42am

You can use the force merge API to trigger merges and it has a parameter named only_expunge_deletes that may help.

system · March 16, 2021, 8:42am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Size on Disk increased after reindexing documents in ES2.2 Elasticsearch	2	804	July 5, 2017
Strange size difference in shards since upgrading to 6.2 Elasticsearch	4	565	May 4, 2018
Index significantly larger after reindexing Elasticsearch	6	1613	July 5, 2017
Shard size / Index number / server count and performance Elasticsearch	4	1388	July 6, 2017
Index get bigger when indexing or updating Elasticsearch	20	2941	September 24, 2020

Shards getting bigger with updates (same number of documents)

Related topics