Shards and Documents

Hi Community,

My Linux Box have 16GB RAM, and 500GB Hard disk.

Here my doubt is,

  1. How many maximum shards I can allocate for my Index?
  2. How much Data can A shard accommodate ?

As per Documentation,
Default shard number is 5, and A shard can accommodate 2,147,483,519 documents. ( Document size OR Maximum shards per Index Not specified :frowning: )

What you read is correct, a shard can hold roughly 2B documents assuming the indexed document has no nested object type inside.

Regarding the number of shards per index, yes, by default if you don't tell ES, whenever you create a new index, it will create an index with 5 shards and 1 replica. To change this number, you can either change ES configuration or using a template.

Regarding the maximum number shards that you can allocate for your index, you'll need to do your own estimate:

  • let's say you are planning to index 5K documents, knowing each shard can hold ~2B documents, in theory, you can use one index with 1 shard.

  • let's say you are planning to index 12B documents, in theory, you can go with 6 shards but in practice, you should use more shards to reduce the number of documents per shard to be less than 2B.

Lastly, with 500GB hard disk, how many documents you can index depends on the size of the documents and how you want ES to handle your data. You can tell ES to index and store data or index but don't store data for every field in the document. Storing data will increase the index size. You'll need to index your data to find out realistically how many documents your linux box can hold.

1 Like

I have incoming data of 100 GB / Day... with 5 concurrent users using simple search queries.

we can allocate 3 boxes as Elasticsearch node.
How much cpu and ram those boxes must have to process the search & indexing efficiently??
Some rough estimation in terms of figures for the compute required in any dummy environment according to you would be of great help.

Suggest you index this 100GB of data to see the actual index size in your environment.

In your first post, you said the Linux box has 500GB disk, and here you are saying you have 100GB of data coming in per day, if you take my suggestion above, you'll find out the index size, that will help you to understand how much data your Linux box can really hold or how much data your 3-node cluster can hold.

Since I don't know what your data looks like, I suggest to stick with the standard and go from there. At least you have a base to start with, any tuning down the road can be validated against this base to see if it's better or worse.

1 Like