We are dumping about 100M ~3KB documents a day into ElasticSearch 1.4.1 and
indexing all fields (of which there are a few dozen). From a read
perspective we perform dynamic queries which may return many results (all
of which may be relevant, we don't use scoring) so we want to keep the
_source field.
Obviously, this is taking a toll on our disk usage and we'd like to reduce
that. Questions:
Is it possible for me to index JSON but set the _source field myself?
I would shove a protobuf or something similar on insert and on query I
would revert it back to JSON
I understand that _source is compressed, but I assume every document
is compressed separately (our small documents don't benefit from that). Is
there a way to somehow compress "across" documents to take advantage of the
fact that our documents are extremely similar to one another?
On Mon, Dec 15, 2014 at 11:49 AM, Eran Duchan pavius@gmail.com wrote:
On Monday, December 15, 2014 5:44:38 PM UTC+2, Robert Muir wrote:
That is not the case, blocks of documents are compressed together:
Thanks, Robert.
I unscientifically swam around the code pivoting around this and saw that:
This isn't tweakable - I can't choose to compress in larger chunks
2.0.0 will have an option to use deflate for better compression
So if I can't tweak _source compression, can I shove a _source of my own as
posted originally in (1)?
Its not really tweakable at all before lucene 5, thats why we added a
higher compression option. Note this option is not just deflate but
also uses a higher blocksize and other internal parameters.
Using a higher blocksize (64kb) for deflate is really a simple
workaround to get the feature out sooner than later, with the idea
that people that choose BEST_COMPRESSION are willing to sacrifice some
retrieval speed.
Increasing blocksize has a negative cost on retrieval performance and
is not really the best way overall to get better compression when
there is high redundancy across documents. In the future I hope we can
add preset dictionary support for sharing across blocks.
So the current blocksize should really be seen as an internal thing.
Seems to me that an uncompressed binary representation ala protobuf would be smaller than compressed JSON given our schema.
If that were to prove correct, is it possibe to do this? I dont expect ElasticSearch to do anything except allowing me to control the contents of _source.
On Mon, Dec 15, 2014 at 12:53 PM, Eran Duchan pavius@gmail.com wrote:
Seems to me that an uncompressed binary representation ala protobuf would be smaller than compressed JSON given our schema.
If that were to prove correct, is it possibe to do this? I dont expect Elasticsearch to do anything except allowing me to control the contents of _source.
The question is how much. for example in my tests it saves 2% with lz4
and 0.5% with deflate (the larger block size used there makes it
super-not-worth-it).
Given those stats I totally agree, but this would have to vary given different schemas... That's why I'd like to at least experiment with it. Is this even possible through the public http interface?
If you want to do such experiments, it will be hard to do with ES,
since you would have to plumb a ton of code to even get the results.
Instead I write lucene code to test these things out. This also makes
the benchmark fast since i dont "index" anything so there is no real
flushing or merging going on to make benchmarking more difficult (when
i want to measure performance of things like merge, i force it to
happen at predictable intervals to keep index size comparisons valid).
On Mon, Dec 15, 2014 at 1:11 PM, Eran Duchan pavius@gmail.com wrote:
Given those stats I totally agree, but this would have to vary given different schemas... That's why I'd like to at least experiment with it. Is this even possible through the public http interface?
Thanks for the pointers. I just realized I can disable _source and "store"
a field with the encoded data (D'oh). If I find anything semi-intelligent
during my tests, I'll report back.
On Mon, Dec 15, 2014 at 10:56 PM, Eran Duchan pavius@gmail.com wrote:
Thanks for the pointers. I just realized I can disable _source and "store"
a field with the encoded data (D'oh). If I find anything semi-intelligent
during my tests, I'll report back.
After a bit of testing, it was found that _source takes up 37% of the total
disk space. Given everything said here (cross document compression, future
support for better compression), we decided to leave this as is.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.