Custom _source compression / compaction to reduce disk usage

Hi list,

We are dumping about 100M ~3KB documents a day into ElasticSearch 1.4.1 and
indexing all fields (of which there are a few dozen). From a read
perspective we perform dynamic queries which may return many results (all
of which may be relevant, we don't use scoring) so we want to keep the
_source field.

Obviously, this is taking a toll on our disk usage and we'd like to reduce
that. Questions:

  1. Is it possible for me to index JSON but set the _source field myself?
    I would shove a protobuf or something similar on insert and on query I
    would revert it back to JSON
  2. I understand that _source is compressed, but I assume every document
    is compressed separately (our small documents don't benefit from that). Is
    there a way to somehow compress "across" documents to take advantage of the
    fact that our documents are extremely similar to one another?
  3. Any other ideas?

Thanks,
Eran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f0588de1-b0e4-468f-ad25-b74f4abb444d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Mon, Dec 15, 2014 at 9:20 AM, Eran Duchan pavius@gmail.com wrote:

I understand that _source is compressed, but I assume every document is
compressed separately (our small documents don't benefit from that).

That is not the case, blocks of documents are compressed together:

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene41/Lucene41StoredFieldsFormat.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXbii9TQWvfNbEx5QPMjhVfOVxSAS8S43eE-GOyzEfo%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Monday, December 15, 2014 5:44:38 PM UTC+2, Robert Muir wrote:

That is not the case, blocks of documents are compressed together:

Thanks, Robert.

I unscientifically swam around the code pivoting around this and saw that:

  1. This isn't tweakable - I can't choose to compress in larger chunks
  2. 2.0.0 will have an option to use deflate
    https://github.com/elasticsearch/elasticsearch/pull/8863 for better
    compression

So if I can't tweak _source compression, can I shove a _source of my own as
posted originally in (1)?

Eran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d036057a-7a04-4ead-a9c6-a91146be1708%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Mon, Dec 15, 2014 at 11:49 AM, Eran Duchan pavius@gmail.com wrote:

On Monday, December 15, 2014 5:44:38 PM UTC+2, Robert Muir wrote:

That is not the case, blocks of documents are compressed together:

Thanks, Robert.

I unscientifically swam around the code pivoting around this and saw that:

This isn't tweakable - I can't choose to compress in larger chunks
2.0.0 will have an option to use deflate for better compression

So if I can't tweak _source compression, can I shove a _source of my own as
posted originally in (1)?

Its not really tweakable at all before lucene 5, thats why we added a
higher compression option. Note this option is not just deflate but
also uses a higher blocksize and other internal parameters.

Using a higher blocksize (64kb) for deflate is really a simple
workaround to get the feature out sooner than later, with the idea
that people that choose BEST_COMPRESSION are willing to sacrifice some
retrieval speed.

Increasing blocksize has a negative cost on retrieval performance and
is not really the best way overall to get better compression when
there is high redundancy across documents. In the future I hope we can
add preset dictionary support for sharing across blocks.

So the current blocksize should really be seen as an internal thing.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZX8-SGzuNvvZ%3D_-Sec_%2Bq3svtLBE-_d3L%2B70TR8Nm_%3Drw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Got it, thanks.
Any insight on a custom _source? Is this doable?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bcf7dae7-2164-43ae-b2bd-c44ed0d99aaa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I dont understand what you hope to benefit from it.

Using a binary encoding isn't going to improve the compression here
really... I have done tests with it.

On Mon, Dec 15, 2014 at 12:12 PM, Eran Duchan pavius@gmail.com wrote:

Got it, thanks.
Any insight on a custom _source? Is this doable?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bcf7dae7-2164-43ae-b2bd-c44ed0d99aaa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUhtNHqG%2BW9KnKWfRzTEUkwG8JWo%2Bz6-%3DGUDkZ-eSu6-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Seems to me that an uncompressed binary representation ala protobuf would be smaller than compressed JSON given our schema.

If that were to prove correct, is it possibe to do this? I dont expect ElasticSearch to do anything except allowing me to control the contents of _source.

Eran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3593253a-ced8-40dd-aa1f-745edd71bef8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Mon, Dec 15, 2014 at 12:53 PM, Eran Duchan pavius@gmail.com wrote:

Seems to me that an uncompressed binary representation ala protobuf would be smaller than compressed JSON given our schema.

If that were to prove correct, is it possibe to do this? I dont expect Elasticsearch to do anything except allowing me to control the contents of _source.

The question is how much. for example in my tests it saves 2% with lz4
and 0.5% with deflate (the larger block size used there makes it
super-not-worth-it).

Its not worth the complexity IMO.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUMG4G6jF7Rw%3Dn9gMenQnm8nytnJ84WMW0K3CjTt%2Bc8TA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Given those stats I totally agree, but this would have to vary given different schemas... That's why I'd like to at least experiment with it. Is this even possible through the public http interface?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29797a61-473e-48f0-9d00-acd82cbe26bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you want to do such experiments, it will be hard to do with ES,
since you would have to plumb a ton of code to even get the results.

Instead I write lucene code to test these things out. This also makes
the benchmark fast since i dont "index" anything so there is no real
flushing or merging going on to make benchmarking more difficult (when
i want to measure performance of things like merge, i force it to
happen at predictable intervals to keep index size comparisons valid).

On Mon, Dec 15, 2014 at 1:11 PM, Eran Duchan pavius@gmail.com wrote:

Given those stats I totally agree, but this would have to vary given different schemas... That's why I'd like to at least experiment with it. Is this even possible through the public http interface?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29797a61-473e-48f0-9d00-acd82cbe26bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOdYfZW0K13wCqUwd3jLF0ifRkQsW_v9cnR5D75zdGTk-%2BWP2w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the pointers. I just realized I can disable _source and "store"
a field with the encoded data (D'oh). If I find anything semi-intelligent
during my tests, I'll report back.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bd777d03-06d9-416b-8366-1b6b1f6e1302%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I'm pretty sure you'll lose cross-document compression that way, which is
highly noticable on lots of 3k large documents

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Mon, Dec 15, 2014 at 10:56 PM, Eran Duchan pavius@gmail.com wrote:

Thanks for the pointers. I just realized I can disable _source and "store"
a field with the encoded data (D'oh). If I find anything semi-intelligent
during my tests, I'll report back.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bd777d03-06d9-416b-8366-1b6b1f6e1302%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/bd777d03-06d9-416b-8366-1b6b1f6e1302%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZttApz%3D6kG%2B_Pw19V%3Db-%2Bbvh-deAsZwRZ3TeeqXuzGq5g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

After a bit of testing, it was found that _source takes up 37% of the total
disk space. Given everything said here (cross document compression, future
support for better compression), we decided to leave this as is.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/69b606dd-ab53-402c-b409-97128840ab59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.