How the size of index is related to number of shards?

Hi there,

I have some question about number of sharding and size of indexes.

I am currently in a single node configuration.

I have one index in february :

  • index with 1 shard, size 477 Gb and 600 million of documents

To increase the query performance, I decided to add more shards in march (in order to have a maximum of 50 Gb per shards).

In march, I now have :

  • index with 14 shards, 800 Gb and almot 550 millions of documents

My questions are the following :

  • Why the store has been multiplied by almost 2? Is it related to some additional data which is stored by the shards?
  • Is it a bad thing to add more shards to my index knowing I only have one node?
  • Maybe I misunderstood some architecture elements?

Thanks!

Best regards,

Hey Vincent

Not sure it's totally accurate here as I'm over simplifying but let me tell you a bit of what could happen behind the scene. Let say that you have an index with the following terms appearing in the following documents:

  • azertyuiop: A, B, C, D, E
  • qsdfghjklm: A, E
  • wxcvbn: B, C, E

If you have one shard, you will create a single shard which has the following data structure on disk:

  • azertyuiop: A, B, C, D, E
  • qsdfghjklm: A, E
  • wxcvbn: B, C, E

If you have 5 shards and let say that A goes to shard 0, B to shard 1, ... E to shard 4, you will have the following data structure on disk

Shard 0:

  • azertyuiop: A
  • qsdfghjklm: A

Shard 1:

  • azertyuiop: B
  • wxcvbn: B

Shard 2:

  • azertyuiop: C
  • wxcvbn: C

Shard 3:

  • azertyuiop: D

Shard 4:

  • azertyuiop: E
  • qsdfghjklm: E
  • wxcvbn: E

As you can see, we have duplicated multiple times the same index entry. So that explains why you won't have exactly the size split by 5.

But, some few things to take into account as well. Do you run _forcemerge to reduce the number of segments to 1 before measuring the size? This operation reduces the used space as the each segment contains the list of the terms used in the segment. Reducing to 1 would help. Also if you are not only appending data, but updating or deleting, using the Force Merge API helps a lot.

Thanks a lot @dadoonet for your explantions and your advice to use the _forcemerge API!