Elastic Search Index Data Compression (v1.4.2)


(Sagar) #1

Hello everyone,
I have been using Elastic Search for storing application logs.

Elastic Search version: 1.4.2
Log Retention Policy: 30 days

Number of logs generated per month: 250 million
Number of shards per index: 5
Number of replicas for index: 1

Logs/Documents in my index are not big, but number of documents are enormous.

Some Stats of existing data and extrapolation based on the same:
0.7 million - 260 MB
250 Million - 92 GB

92 GB of data per site for just application logs sounds too much to me.
So I am keen to know if these indexes or logs data under can be compressed? If so, what performance impact can it make?
My writes to elastic search will be more frequent and concurrent, while search requests will not be much frequent.

Please advise.

Appreciate.

Regards,
Sagar Shah


(Christian Dahlqvist) #2

The size your log data takes up on disk depends a lot on the type of data you have and how you are mapping and indexing it. The logstash default settings indexes all text fields as both analysed and not_analysed, which gives a lot of flexibility, but can take up a lot of disk space. We published a blog post a while back that looked at how different mappings affect the size indexed data takes up on disk for a few sample data types. This shows that even though Elasticsearch already does apply compression when it indexes data, the size of indexed data on disk can still be larger than the raw data depending on what mappings that are used.


(Sagar) #3

Thank you Christian!
That helps :slight_smile:


(Sagar) #4

And if I understand it correctly, Default configuration of Elastic Search provides compression support. Is that correct?

Thanks again!


(Christian Dahlqvist) #5

Elasticsearch already compresses data internally by default. The current algorithm balances speed and compression, but the ability to specify more efficient compression is coming in version 2.0.


(Sagar) #6

Thank you Christian. Keen to use Elastic Search v 2.0 :slight_smile:


(Sagar) #7

There's one more finding in this.
I had one index on one of our QA box (a month old index) which has around 0.7 million records taking about 260 MB.
I created a new index on same server with same mapping and pulled all records one by one from existing index and pushed into this new index.
Surprisingly, I see the space taken by new index (a day old) with same mapping and setting as 136 MB only.
What could make such a big difference here?

Please clarify.

Appreciate!


(Christian Dahlqvist) #8

Do you have the same number of shards for both indices? Do you have any deleted documents, e.g. due to updates in the older index?

When comparing size of indices, I generally optimise them first to ensure they are as compact as possible. Can you try optimising the indices and see if the size difference remain?


(Sagar) #9

Thanks Christian for your reply.
Both have same number of shards (5).
But yes, there were some documents deleted from original index at some point, which was done with the help of _ttl field.

How can I optimize those index?

Please advise.

Appreciate!


(Sagar) #10

I found this article
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-optimize.html

I can optimize this and compare the stats thereafter.
Is it an expensive process? Does it take long time for big index?
Does it block all my incoming index requests while it's optimizing?


(Harlin) #11

Optimizing an index is extremely expensive, especially if it is optimized all the way down to 1 segment per shard. Also, never optimize an index that is still getting index requests, it will cause all kinds of problems. I would only ever call optimize on an index if it is not being indexed into any longer and your cluster is also not doing too much else.


(system) #12