ElasticSearch index size peculiarity


(Avleen Vig) #1

A couple of us have been comparing notes on our Logstash installations at
the larger end of the scale, and something about ElasticSearch has us
baffled.
We're hoping someone here can shed some light on this.

Currently I have 228,262,883 documents in an index.
It's taking up 243.1Gb of space on disk.
The average size of the messages going into Logstash (which then converted
to json and put in ES) was only ~500 bytes each.

At 500 bytes, that's about 106Gb of raw logs.
I'm adding many fields to the json which gets dropped into ES, but still..
I would expect that with compression that space used would go down, not
up.

This is my mapping: https://gist.github.com/avleen/7440270
The only field being analyzed, is "message".

And.. we just removed the "message" field from being sent to elasticsearch.
The docs/index size ratio did not change much at all (if any).
Still getting ~1k - 1.5k disk space used, per document in elasticsearch.
It seems odd that for such a small source, that the stored space should be
so much larger even with LZ4 compression?

We noticed that while store-level compression might be helping some, it
doesn't seem to be helping as much as it could. Running gzip on the data
(raw and the index files) seems to provide quite a bit more compression
that we're getting right now.
Likewise, enabling compression on ZFS reduced that space taken by almost
half.

Overall, I'm trying to index several billion log lines per day, and
multi-Tb indexes add up in cost.
Does anyone have any suggestions on what we could do?

(big thanks to Jordan who has already gone way out of way to help with
this!)
Thanks :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Radu Gheorghe) #2

Hi Avleen,

I only have two ideas, but who knows :slight_smile:

First one is that, looking at your mapping, maybe Logstash really makes
your document a lot bigger than the originals. Even if most of the stuff is
not_analyzed. To verify this, you can do an experiment:

This should get you the size of the JSONs you're indexing. You may even
want to run a statistical facet to see what's the average size of the logs
you indexed:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-statistical-facet.html

If you get 3k per document, I'm not sure you can do much.

The second idea may be a bit in Captain Obvious territory, but it's worth
checking. Which ES version are you on? With 0.90 or later, you should have
good compression by default, at the Lucene level. So, if you're not already
there, it might be worth upgrading to the latest 0.90.7, if you're using
Logstash's elasticsearch_http output. If you're using the "standard"
elasticsearch output plugin, it might be worth upgrading to the latest
Logstash (1.2.2, runs on 0.90.3). Latest Logstash should be faster, too. I
know because I'm using it:

Best regards,
Radu

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Tue, Nov 12, 2013 at 7:34 PM, Avleen Vig avleen@gmail.com wrote:

A couple of us have been comparing notes on our Logstash installations at
the larger end of the scale, and something about ElasticSearch has us
baffled.
We're hoping someone here can shed some light on this.

Currently I have 228,262,883 documents in an index.
It's taking up 243.1Gb of space on disk.
The average size of the messages going into Logstash (which then converted
to json and put in ES) was only ~500 bytes each.

At 500 bytes, that's about 106Gb of raw logs.
I'm adding many fields to the json which gets dropped into ES, but still..
I would expect that with compression that space used would go down, not
up.

This is my mapping: https://gist.github.com/avleen/7440270
The only field being analyzed, is "message".

And.. we just removed the "message" field from being sent to
elasticsearch. The docs/index size ratio did not change much at all (if
any).
Still getting ~1k - 1.5k disk space used, per document in elasticsearch.
It seems odd that for such a small source, that the stored space should be
so much larger even with LZ4 compression?

We noticed that while store-level compression might be helping some, it
doesn't seem to be helping as much as it could. Running gzip on the data
(raw and the index files) seems to provide quite a bit more compression
that we're getting right now.
Likewise, enabling compression on ZFS reduced that space taken by almost
half.

Overall, I'm trying to index several billion log lines per day, and
multi-Tb indexes add up in cost.
Does anyone have any suggestions on what we could do?

(big thanks to Jordan who has already gone way out of way to help with
this!)
Thanks :slight_smile:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3