I'm trying a simple test to understand the size of index base on what I observed.
Case 1: Total Indexed Volume 15 Million Documents of size (74GB) . Index size 38.1 GB.
Case 2: Total Indexed Volume 500K Documents of size (3 GB) . Index size 18 GB.
Case 1 is Great compression where as Case 2 is opposite way. The 500K is a subset for 15 Millon. If my understanding is correct it is because of repetitive terms that come from analyzed field.
Question 1: Can someone clarify please?
Also, on other note, I used a single document and created 3 versions of index (0 replica, 1 shard) based on same document, which is size 4 KB in raw.
v1 - Analyzed on single attribute
v2 - Analyzed on single attribute, but _all is set to fals
v3 - No attribute is analyzed
When I put the content, below is what the output I saw
get _cat/shards/v1,v2,v3?v
index shard prirep state docs store ip node
v1 0 p STARTED 5 18.8kb 127.0.0.1 Wildboys
v2 0 p STARTED 5 18.8kb 127.0.0.1 Wildboys
v2 0 p STARTED 5 19kb 127.0.0.1 Wildboys
It would be helpful if someone clarifies below queries
Question 2 : How is it that size is so greater than original text. I tried doing /v1/_analyze...on analyzed content and it translates to 18 terms
Question 3: Why docs is 5. get _cat/indices/v1,v2,v3?v also says 5 as document count, though it is only one. get /v1/_count says correctly as 1.
Question 4: What is a recommended size of Shard and how many shard we could have? <=50 GB on a 14GB RAM machine? Is there any logic for computing the same.
Question 5: Any specific options to reduce size of index other than below
_source=false, which I cannot as I'm not storing fields individually and would avoid
_all=False. I have only one analyzed field. Rest all is not_analyzed
NOTE: I referred below URLs for validating various items
My goal is to get to 20 Million documents/day and keep it for at-least 6-7 months (all hot and search/aggregatable)