Size of Index

Karthik_Ramachandran · March 1, 2016, 2:45am

I'm trying a simple test to understand the size of index base on what I observed.

Case 1: Total Indexed Volume 15 Million Documents of size (74GB) . Index size 38.1 GB.
Case 2: Total Indexed Volume 500K Documents of size (3 GB) . Index size 18 GB.

Case 1 is Great compression where as Case 2 is opposite way. The 500K is a subset for 15 Millon. If my understanding is correct it is because of repetitive terms that come from analyzed field.

Question 1: Can someone clarify please?

Also, on other note, I used a single document and created 3 versions of index (0 replica, 1 shard) based on same document, which is size 4 KB in raw.
v1 - Analyzed on single attribute
v2 - Analyzed on single attribute, but _all is set to fals
v3 - No attribute is analyzed

When I put the content, below is what the output I saw

get _cat/shards/v1,v2,v3?v

index shard prirep state docs store ip node
v1 0 p STARTED 5 18.8kb 127.0.0.1 Wildboys
v2 0 p STARTED 5 18.8kb 127.0.0.1 Wildboys
v2 0 p STARTED 5 19kb 127.0.0.1 Wildboys

It would be helpful if someone clarifies below queries

Question 2 : How is it that size is so greater than original text. I tried doing /v1/_analyze...on analyzed content and it translates to 18 terms

Question 3: Why docs is 5. get _cat/indices/v1,v2,v3?v also says 5 as document count, though it is only one. get /v1/_count says correctly as 1.

Question 4: What is a recommended size of Shard and how many shard we could have? <=50 GB on a 14GB RAM machine? Is there any logic for computing the same.

Question 5: Any specific options to reduce size of index other than below
_source=false, which I cannot as I'm not storing fields individually and would avoid
_all=False. I have only one analyzed field. Rest all is not_analyzed

NOTE: I referred below URLs for validating various items

My goal is to get to 20 Million documents/day and keep it for at-least 6-7 months (all hot and search/aggregatable)

warkolm · March 1, 2016, 2:59am

That blog post is pretty old! While some of it is still relevant, be aware that things change over time.

Correct.

Depends what the document looks like.

Count includes deleted docs, it could be that.

That's a larger question not directly answerable by providing a number of shards.
But you should setup a test that creates a number of indices on the node and see what it can cope with.

Karthik_Ramachandran · March 1, 2016, 6:20am

Thanks Mark. But for Q3, I didn't delete any documents. I just inserted viz. POST one document and took the metrics.

warkolm · March 1, 2016, 6:28am

Then you must have indexed more?

Youxu · March 1, 2016, 6:55am

For the Q3, it is better you post your complete repro steps (with curl commands), this can help others better understand your scenario and identify the root cause easier.

Karthik_Ramachandran · March 3, 2016, 7:45pm

Below is the sequence of commands I used.

Create Index with mappings attached
Check for document counts
get _cat/indices/test?v
get _cat/shards/test?v
get /test/_count
Add one single document using POST
POST /test/en/1207407677
{"DId":"38383838383383838","date":"2015-12-06T07:27:23","From":"TWITTER","Title":"","Link":"https://twitter.com/test/test/673403345713815552","SourceDomain":"twitter.com","Content":"@sadfasdfasf Join us for the event on ABC tech and explore more https:\/\/t.co\/SDDJDJD via https:\/\/t.co\/RUXLEISC","FriendsCount":20543,"FollowersCount":34583,"Score":null}
Check the count
get _cat/indices/test?v
get _cat/shards/test?v
get /test/_count

ES Version 2.2.0

warkolm · March 3, 2016, 9:18pm

Attached where? I think you may have missed this

Karthik_Ramachandran · March 3, 2016, 9:20pm

Hi Mark. I created the mappings representing the POST. But we can report without mapping as well :-).. I should have removed that (1.).

Topic		Replies	Views
Understanding my Index using HEAD plugin Elasticsearch	4	1716	July 6, 2017
Lucene vs elasticsearch file size Elasticsearch	5	391	July 6, 2017
Index significantly larger after reindexing Elasticsearch	6	1634	July 5, 2017
Question on Index Size Elasticsearch	4	434	July 6, 2017
Shard size / Index number / server count and performance Elasticsearch	4	1409	July 6, 2017

Size of Index

get _cat/shards/v1,v2,v3?v

Related topics