How to optimize Elasticsearch index data space


(Roopendra) #1

I am parsing Apache access log from Logstash and indexing it into a Elasticsearch index. I am also indexing geoip and agent fields. While indexing I observed elasticsearch index size is 6.7x bigger than the actual file size (space on disk). So I just want to understand this is the correct behavior or I am doing something wrong here? I am doing this in Elasticsearch 5.0 , Logstash 5.0 and Kibana 5.0 version.

My Observations:

Use Case 1:

Apache Log file Size : 211 MB
Total number of lines: 1,000,000
Index Size: 1.5 GB
Observation: Index is 6.7x bigger than the file size.

Use Case 2:

I have found few solutions to compress elasticsearch index then I tried it as well.

  • Disable _all fields
  • Remove unwanted fields that has been created by "geoip" and "agent" parsing.
  • Enable best_compression [ index.codec": "best_compression"]

Apache Log file Size : 211 MB
Total number of lines: 1,000,000
Index Size: 1.3 GB
Observation: Index is 6.16x bigger than the file size

Log File Format:

219.161.55.250 - - [24/Nov/2016:02:03:08 -0800] "GET /wp-admin HTTP/1.0" 200 4916 "http://trujillo-carpenter.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.01; Trident/5.1)"

I found Logstash + Elasticsearch Storage Experiments they are saying they have reduced index size from 6.23x to 1.57x. But that is pretty old solutions and these solution are no more working in Elasticsearch 5.0.

Some more reference I have already tried:

Part 2.0: The true story behind Elasticsearch storage requirements

https://github.com/elastic/elk-index-size-tests

Is there any better way to optimize the Elasticsearch index size when your purpose is only show the visualization on Kibana?


(Christian Dahlqvist) #2

I was not able to access the gists you linked to. Can you check the links? How many shards are you indexing into? Did you do a _forcemerge down to 1 segment once indexing completed?

When I perform these kind of calculations, I often split it up into two factors:

  1. Size of generated JSON documents / Raw size
  2. Indexed size on disk / Size of generated JSON documents

The level of enrichment performed will affect the first one and the mappings the second.


(Roopendra) #3

Thanks for your reply. Attaching configuration and template file from pastebin

Observation 1:

Logstash Conf: http://pastebin.com/yexJT5Sb
Template File: http://pastebin.com/upDV2s2Z

Observation 2:
Logstash Conf: http://pastebin.com/H897rd4G
Template File: http://pastebin.com/1JSaJwyL

I am using 5 shards and 0 replica. I haven't try_forcemerge setting yet. But I tried _flush after indexing the data.


(Christian Dahlqvist) #4

I looked at observation 2 and noticed the following things:

  1. The template used uses Elasticsearch 2.x mappings, which have changed in 5.0.
  2. Even if the template was adjusted for ES 5.0 mappings, it will not be applied to the created index as the template would not be applicable to the index name used, only an index named elk_workshop. Retrieve the mappings for the index to see what is actually in effect.
  3. Shard and segment size can have a significant impact on the indexing overhead, as compression improves with increased data volumes. Given that you index 1000000 records, I would set this up to use a single shard and _forcemerge it down to 1 segment as this can make a big difference.
  4. Whenever I keep the @message field, I often set this to not_indexed as I rarely or never query based on it.

(Roopendra) #5

Thanks for your suggestion.

  1. I have also tried with below template which is compatible with Elasticsearch-5.0 version

https://github.com/elastic/examples/blob/master/ElasticStack_apache/apache_template.json
2) Do you mean, template is not applying properly? It will very helpful if you share correct template.
3) I am doing this just for POC. In actual implementation data size may be 100miliions So limiting it to shard 1 may not help here. May be I need your expert suggestion :slight_smile:
4) I will definitely try it.


(Christian Dahlqvist) #6

The template parameter contains the pattern that will determine whether it is applied to a new index or not, see the documentation on index templates. The template you have specified here will therefore only apply to an index named apache_elastic_example.

The size of shards and segments matter. With only 1000000 documents I would recommend using a single shard in order to ensure it gets up to a decent size. When you have more data, you would naturally have more than 1 shards, but you should try to keep the average shard size in GB range.


(Roopendra) #7

Thanks @Christian_Dahlqvist . Template was not applying to the index. due to that compression was not applying. Also I tried with number of shard 1, it's significantly reduces the size of index.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.