Understanding source compression

Hello, everyone.

I d'like to make sure I understood the _source compression well.

As I understood, it makes no impact on the search itself. The inverted
indexes are not compressed, so, no decompression has to be made.
When the results are given, however, each hit has to be decompressed. (that
is, if I specifically asked for a size of 10, there will be only 10
decompression operations, even if several millions docs are matching the
query).
This should be, for search operation, the only drawback, performance wise.
(as I understood, quite a minimal one).
Is that correct?

my indexes are composed of several millions of rather big docs (500+
fields, 15+ nested collections). For each indexation, the source will have
to be compressed, at a performance loss.
Thus, the bigger drawback to compression would be the indexation
performance.
Is that correct?

Thanks for any clarifications.
regards
Deny

--

Have a look at this excellent Post from Adrien Grand: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

It won't answer directly but will give you some hints about performance when using compression.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 23 janv. 2013 à 10:48, DH ciddp195@gmail.com a écrit :

Hello, everyone.

I d'like to make sure I understood the _source compression well.

As I understood, it makes no impact on the search itself. The inverted indexes are not compressed, so, no decompression has to be made.
When the results are given, however, each hit has to be decompressed. (that is, if I specifically asked for a size of 10, there will be only 10 decompression operations, even if several millions docs are matching the query).
This should be, for search operation, the only drawback, performance wise. (as I understood, quite a minimal one).
Is that correct?

my indexes are composed of several millions of rather big docs (500+ fields, 15+ nested collections). For each indexation, the source will have to be compressed, at a performance loss.
Thus, the bigger drawback to compression would be the indexation performance.
Is that correct?

Thanks for any clarifications.
regards
Deny

--

Interesting indeed, thanks, David.

As for the compression, I ran some tests, Performance-wise, I hardly see
the difference between an compressed or a uncompressed index. So I guesse
the answers to my questions are "yes".
so, store or source compression is, for big indexes and as far as I know, a
must. You gain disk space and do not suffer performance penalties.

Le mercredi 23 janvier 2013 10:48:30 UTC+1, DH a écrit :

Hello, everyone.

I d'like to make sure I understood the _source compression well.

As I understood, it makes no impact on the search itself. The inverted
indexes are not compressed, so, no decompression has to be made.
When the results are given, however, each hit has to be decompressed.
(that is, if I specifically asked for a size of 10, there will be only 10
decompression operations, even if several millions docs are matching the
query).
This should be, for search operation, the only drawback, performance wise.
(as I understood, quite a minimal one).
Is that correct?

my indexes are composed of several millions of rather big docs (500+
fields, 15+ nested collections). For each indexation, the source will have
to be compressed, at a performance loss.
Thus, the bigger drawback to compression would be the indexation
performance.
Is that correct?

Thanks for any clarifications.
regards
Deny

--

You should be aware that lucene 4.1 introduces compression for stored
fields. So, you might want to test with that using the latest elastic
search snapshots (I believe this is on by default in there).

Generally, lucene 4.1 is going to be nice for the type of scenario you
outline. Compression and more efficient storage codecs are going to make IO
a lot less of a factor when querying. Also, the hit for compression should
only happen on the records you actually return. So, that should avoid a lot
of decompress operations relative to 3.6.x.

The impact of compression is in my view generally worth it. People
typically overestimate the amount of CPU it takes to compress/decompress
and underestimate the effect of cutting a large percentage of disk/network
IO. You have to benchmark of course but my experience with lucene and solr
is that things are fine as long as indices and other data structures fit in
memory. Especially on large indices, limiting disk IO to the bare minimum
can make a lot of difference. IO tends to be the limiting factor on index
size, not CPU. So less IO is a good thing.

Jilles

On Wednesday, January 23, 2013 10:48:30 AM UTC+1, DH wrote:

Hello, everyone.

I d'like to make sure I understood the _source compression well.

As I understood, it makes no impact on the search itself. The inverted
indexes are not compressed, so, no decompression has to be made.
When the results are given, however, each hit has to be decompressed.
(that is, if I specifically asked for a size of 10, there will be only 10
decompression operations, even if several millions docs are matching the
query).
This should be, for search operation, the only drawback, performance wise.
(as I understood, quite a minimal one).
Is that correct?

my indexes are composed of several millions of rather big docs (500+
fields, 15+ nested collections). For each indexation, the source will have
to be compressed, at a performance loss.
Thus, the bigger drawback to compression would be the indexation
performance.
Is that correct?

Thanks for any clarifications.
regards
Deny

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi all,

On Wed, Jan 23, 2013 at 10:48 AM, DH ciddp195@gmail.com wrote:

As I understood, it makes no impact on the search itself. The inverted
indexes are not compressed, so, no decompression has to be made.

It's true that Lucene doesn't use a general-purpose compression
algorithm to compress the inverted index, but it tries to use a very
compact representation (based on delta-encoding and bit-packing[1]
since Lucene 4.1 and variable-length encoding[2][3] in older versions)
so it can be likened to compression. The good news is that the
inverted index in Lucene 4.1 is faster (see annotation AB on [4]) and
usually smaller.

[1] Lucene41PostingsFormat (Lucene 4.1.0 API)
[2] DataOutput (Lucene 4.1.0 API)
[3] Lucene40PostingsFormat (Lucene 4.0.0 API)
[4] Lucene TermQuery queries/sec

When the results are given, however, each hit has to be decompressed. (that
is, if I specifically asked for a size of 10, there will be only 10
decompression operations, even if several millions docs are matching the
query).
This should be, for search operation, the only drawback, performance wise.
(as I understood, quite a minimal one).
Is that correct?

This is correct.

my indexes are composed of several millions of rather big docs (500+ fields,
15+ nested collections). For each indexation, the source will have to be
compressed, at a performance loss.
Thus, the bigger drawback to compression would be the indexation
performance.
Is that correct?

If the compression algorithm is lightweight (it's the case for both
LZF (used by Elasticsearch) and LZ4 (used by Lucene 4.1)), it won't
necessary be the indexing bottleneck, especially if your analysis
chain is costly. Moreover, given that it reduces the amount of I/O to
perform, it could make indexing faster on slow disks.

On Thu, Jan 31, 2013 at 11:49 AM, Jilles van Gurp
jillesvangurp@gmail.com wrote:

The impact of compression is in my view generally worth it. People typically
overestimate the amount of CPU it takes to compress/decompress and
underestimate the effect of cutting a large percentage of disk/network IO.
You have to benchmark of course but my experience with lucene and solr is
that things are fine as long as indices and other data structures fit in
memory. Especially on large indices, limiting disk IO to the bare minimum
can make a lot of difference. IO tends to be the limiting factor on index
size, not CPU. So less IO is a good thing.

I can't agree more, this is precisely what motivated me to make stored
fields compressed by default in Lucene 4.1!

--
Adrien

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.