How the size of index is related to number of shards?

vincent2mots · April 3, 2024, 1:06pm

Hi there,

I have some question about number of sharding and size of indexes.

I am currently in a single node configuration.

I have one index in february :

index with 1 shard, size 477 Gb and 600 million of documents

To increase the query performance, I decided to add more shards in march (in order to have a maximum of 50 Gb per shards).

In march, I now have :

index with 14 shards, 800 Gb and almot 550 millions of documents

My questions are the following :

Why the store has been multiplied by almost 2? Is it related to some additional data which is stored by the shards?
Is it a bad thing to add more shards to my index knowing I only have one node?
Maybe I misunderstood some architecture elements?

Thanks!

Best regards,

dadoonet · April 3, 2024, 5:36pm

Hey Vincent

Not sure it's totally accurate here as I'm over simplifying but let me tell you a bit of what could happen behind the scene. Let say that you have an index with the following terms appearing in the following documents:

azertyuiop: A, B, C, D, E
qsdfghjklm: A, E
wxcvbn: B, C, E

If you have one shard, you will create a single shard which has the following data structure on disk:

azertyuiop: A, B, C, D, E
qsdfghjklm: A, E
wxcvbn: B, C, E

If you have 5 shards and let say that A goes to shard 0, B to shard 1, ... E to shard 4, you will have the following data structure on disk

Shard 0:

azertyuiop: A
qsdfghjklm: A

Shard 1:

azertyuiop: B
wxcvbn: B

Shard 2:

azertyuiop: C
wxcvbn: C

Shard 3:

azertyuiop: D

Shard 4:

azertyuiop: E
qsdfghjklm: E
wxcvbn: E

As you can see, we have duplicated multiple times the same index entry. So that explains why you won't have exactly the size split by 5.

But, some few things to take into account as well. Do you run _forcemerge to reduce the number of segments to 1 before measuring the size? This operation reduces the used space as the each segment contains the list of the terms used in the segment. Reducing to 1 would help. Also if you are not only appending data, but updating or deleting, using the Force Merge API helps a lot.

vincent2mots · April 9, 2024, 3:25pm

Thanks a lot @dadoonet for your explantions and your advice to use the _forcemerge API!

Topic		Replies	Views
Adding shards as the volume of data grows Elasticsearch	2	343	July 6, 2017
Dynamic growing, a solution for a fixed shard number? Elasticsearch	2	1814	July 6, 2017
Shard Elasticsearch	6	320	July 6, 2017
Is there a preferred config for Index / Shard configuration? Lots of indexes with lots of shards or fewer indexes and bigger shards? Elasticsearch	3	684	July 6, 2017
How many shards should I put in a node？ Elasticsearch	7	1461	July 5, 2017

How the size of index is related to number of shards?

Related topics