Index size seems massive to what the data is being sent

Hello,

I am sending metrics via collectd on a 10s interval rate. my daily index comes out to a size from 25-27gb. This seems a bit massive? The index settings are setup as follows "number_of_shards": "5",
"number_of_replicas": "1". I have a total of 12KV that are sent in the message. Is there something else that can be making this index so large?

Hey Kenneth,

Elasticsearch does a number of things to make your data more easily searchable, all of which will add some overhead to your documents, and therefore index size -- On top of that you also have 1 replica, so the data volume will be doubled immediately.

Analysed fields (or text fields in ES 6.x) will use some overhead since a field is split up into individual terms -- If you have any string fields that don't require full text search (i.e. you know the exact thing you want to search for), consider setting those fields not_analyzed, or keyword type in ES 6.

You also have two additional special fields that will cause bloat in the size of your documents:

_source -- The original raw JSON in its entirety. It can be disabled but there are some warnings/considerations to take first: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

_all (Remove in 6.x) -- A concatenation of all the values of your fields, for when you want to search for a specific value, but you don't care about a particular field. Off the top of my head, not sure if this is enabled by default or not.

All of the above will use additional storage overhead easily resulting in a document that's larger than the original document size, upto x2, x3 etc. You can tweak those settings in your index (and also not index any fields you don't care about) which could help save some space.

You can also apply a difference codec to your index, i.e. best_compression which will help keep storage usage down, but will have some CPU overhead: https://www.elastic.co/blog/store-compression-in-lucene-and-elasticsearch

Hope that helps

Cheers,
Mike

I will give this a try. Thank you @Evesy

@Evesy we tried to disable the _source option but that seems to break alot of stuff so that is not an option for us. I am still curious to how this data set is so large. We did a trail run on writing striaght to disk and it comes out to about a .5 gb per day. But ES seems to make it 10-20x that . Is there something I am missing in what is being done within ES? Thanks for any help!!

I'm sure someone with more knowledge can weigh in with more details around what causes inflated disk usage in Elastic compared to raw data storage.

I've found https://qbox.io/blog/elasticsearch-5-0-disk-usage-tuning to be quite useful in reducing storage use though, there's quite a few options in there that are a lot less 'breaking' than disabling the _source field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.