Size of Index

I'm trying a simple test to understand the size of index base on what I observed.

Case 1: Total Indexed Volume 15 Million Documents of size (74GB) . Index size 38.1 GB.
Case 2: Total Indexed Volume 500K Documents of size (3 GB) . Index size 18 GB.

Case 1 is Great compression where as Case 2 is opposite way. The 500K is a subset for 15 Millon. If my understanding is correct it is because of repetitive terms that come from analyzed field.

Question 1: Can someone clarify please?

Also, on other note, I used a single document and created 3 versions of index (0 replica, 1 shard) based on same document, which is size 4 KB in raw.
v1 - Analyzed on single attribute
v2 - Analyzed on single attribute, but _all is set to fals
v3 - No attribute is analyzed

When I put the content, below is what the output I saw

get _cat/shards/v1,v2,v3?v

index shard prirep state docs store ip node
v1 0 p STARTED 5 18.8kb 127.0.0.1 Wildboys
v2 0 p STARTED 5 18.8kb 127.0.0.1 Wildboys
v2 0 p STARTED 5 19kb 127.0.0.1 Wildboys

It would be helpful if someone clarifies below queries

Question 2 : How is it that size is so greater than original text. I tried doing /v1/_analyze...on analyzed content and it translates to 18 terms

Question 3: Why docs is 5. get _cat/indices/v1,v2,v3?v also says 5 as document count, though it is only one. get /v1/_count says correctly as 1.

Question 4: What is a recommended size of Shard and how many shard we could have? <=50 GB on a 14GB RAM machine? Is there any logic for computing the same.

Question 5: Any specific options to reduce size of index other than below
_source=false, which I cannot as I'm not storing fields individually and would avoid
_all=False. I have only one analyzed field. Rest all is not_analyzed

NOTE: I referred below URLs for validating various items


My goal is to get to 20 Million documents/day and keep it for at-least 6-7 months (all hot and search/aggregatable)

That blog post is pretty old! While some of it is still relevant, be aware that things change over time.

Correct.

Depends what the document looks like.

Count includes deleted docs, it could be that.

That's a larger question not directly answerable by providing a number of shards.
But you should setup a test that creates a number of indices on the node and see what it can cope with.

Thanks Mark. But for Q3, I didn't delete any documents. I just inserted viz. POST one document and took the metrics.

Then you must have indexed more?

For the Q3, it is better you post your complete repro steps (with curl commands), this can help others better understand your scenario and identify the root cause easier.

Below is the sequence of commands I used.

  1. Create Index with mappings attached

  2. Check for document counts
    get _cat/indices/test?v
    get _cat/shards/test?v
    get /test/_count

  3. Add one single document using POST
    POST /test/en/1207407677
    {"DId":"38383838383383838","date":"2015-12-06T07:27:23","From":"TWITTER","Title":"","Link":"https://twitter.com/test/test/673403345713815552","SourceDomain":"twitter.com","Content":"@sadfasdfasf Join us for the event on ABC tech and explore more https:\/\/t.co\/SDDJDJD via https:\/\/t.co\/RUXLEISC","FriendsCount":20543,"FollowersCount":34583,"Score":null}

  4. Check the count
    get _cat/indices/test?v
    get _cat/shards/test?v
    get /test/_count

ES Version 2.2.0

Attached where? I think you may have missed this :slight_smile:

Hi Mark. I created the mappings representing the POST. But we can report without mapping as well :-).. I should have removed that (1.).