I am parsing Apache access log from Logstash and indexing it into a Elasticsearch index. I am also indexing geoip and agent fields. While indexing I observed elasticsearch index size is 6.7x bigger than the actual file size (space on disk). So I just want to understand this is the correct behavior or I am doing something wrong here? I am doing this in Elasticsearch 5.0 , Logstash 5.0 and Kibana 5.0 version.
Apache Log file Size : 211 MB Total number of lines: 1,000,000 Index Size: 1.3 GB Observation: Index is 6.16x bigger than the file size
Log File Format:
219.161.55.250 - - [24/Nov/2016:02:03:08 -0800] "GET /wp-admin HTTP/1.0" 200 4916 "http://trujillo-carpenter.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.01; Trident/5.1)"
I found Logstash + Elasticsearch Storage Experiments they are saying they have reduced index size from 6.23x to 1.57x. But that is pretty old solutions and these solution are no more working in Elasticsearch 5.0.
I was not able to access the gists you linked to. Can you check the links? How many shards are you indexing into? Did you do a _forcemerge down to 1 segment once indexing completed?
When I perform these kind of calculations, I often split it up into two factors:
Size of generated JSON documents / Raw size
Indexed size on disk / Size of generated JSON documents
The level of enrichment performed will affect the first one and the mappings the second.
I looked at observation 2 and noticed the following things:
The template used uses Elasticsearch 2.x mappings, which have changed in 5.0.
Even if the template was adjusted for ES 5.0 mappings, it will not be applied to the created index as the template would not be applicable to the index name used, only an index named elk_workshop. Retrieve the mappings for the index to see what is actually in effect.
Shard and segment size can have a significant impact on the indexing overhead, as compression improves with increased data volumes. Given that you index 1000000 records, I would set this up to use a single shard and _forcemerge it down to 1 segment as this can make a big difference.
Whenever I keep the @message field, I often set this to not_indexed as I rarely or never query based on it.
I have also tried with below template which is compatible with Elasticsearch-5.0 version
https://github.com/elastic/examples/blob/master/ElasticStack_apache/apache_template.json
2) Do you mean, template is not applying properly? It will very helpful if you share correct template.
3) I am doing this just for POC. In actual implementation data size may be 100miliions So limiting it to shard 1 may not help here. May be I need your expert suggestion
4) I will definitely try it.
The template parameter contains the pattern that will determine whether it is applied to a new index or not, see the documentation on index templates. The template you have specified here will therefore only apply to an index named apache_elastic_example.
The size of shards and segments matter. With only 1000000 documents I would recommend using a single shard in order to ensure it gets up to a decent size. When you have more data, you would naturally have more than 1 shards, but you should try to keep the average shard size in GB range.
Thanks @Christian_Dahlqvist . Template was not applying to the index. due to that compression was not applying. Also I tried with number of shard 1, it's significantly reduces the size of index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.