Aggressive index compression

Hi there,

We're willing to index lots of data, however after a certain period of time
it doesn't have to be as "hot" as data from the past week. What do you
experts think of the following:

  1. Open index
  2. Write lots of data into it

Index becomes less important after 7 days

  1. Close index
  2. (g)zip the index

Index remains gzip

  1. Request for search on index
  2. Un(g)zip the index
  3. Open the index
  4. Perform the search

Is there anything I miss here, that would cause problems? A quick test on a
single node worked perfectly. The default compression doesn't help us
enough.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

The relatively new compression option should do the trick without the need to gzip your index (you won't gain that much). More details here: Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Sep 18, 2012, at 3:09 PM, Robin Verlangen robin@us2.nl wrote:

Hi there,

We're willing to index lots of data, however after a certain period of time it doesn't have to be as "hot" as data from the past week. What do you experts think of the following:

  1. Open index
  2. Write lots of data into it

Index becomes less important after 7 days

  1. Close index
  2. (g)zip the index

Index remains gzip

  1. Request for search on index
  2. Un(g)zip the index
  3. Open the index
  4. Perform the search

Is there anything I miss here, that would cause problems? A quick test on a single node worked perfectly. The default compression doesn't help us enough.

Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.

--

--

Thank you for the reference, however I was already aware of those options.
A quick benchmark gave us indication we could still win a lot: I'll publish
the details in here soon!

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/18 Shay Banon kimchy@gmail.com

The relatively new compression option should do the trick without the need
to gzip your index (you won't gain that much). More details here:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Sep 18, 2012, at 3:09 PM, Robin Verlangen robin@us2.nl wrote:

Hi there,

We're willing to index lots of data, however after a certain period of
time it doesn't have to be as "hot" as data from the past week. What do you
experts think of the following:

  1. Open index
  2. Write lots of data into it

Index becomes less important after 7 days

  1. Close index
  2. (g)zip the index

Index remains gzip

  1. Request for search on index
  2. Un(g)zip the index
  3. Open the index
  4. Perform the search

Is there anything I miss here, that would cause problems? A quick test on
a single node worked perfectly. The default compression doesn't help us
enough.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

--

--

I published my findings on Elasticsearch compression on my personal blog
with some in-depth information about the benchmark:
http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elasticsearch-compression-benchmark.html

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Robin Verlangen robin@us2.nl

Thank you for the reference, however I was already aware of those options.
A quick benchmark gave us indication we could still win a lot: I'll publish
the details in here soon!

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/18 Shay Banon kimchy@gmail.com

The relatively new compression option should do the trick without the
need to gzip your index (you won't gain that much). More details here:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Sep 18, 2012, at 3:09 PM, Robin Verlangen robin@us2.nl wrote:

Hi there,

We're willing to index lots of data, however after a certain period of
time it doesn't have to be as "hot" as data from the past week. What do you
experts think of the following:

  1. Open index
  2. Write lots of data into it

Index becomes less important after 7 days

  1. Close index
  2. (g)zip the index

Index remains gzip

  1. Request for search on index
  2. Un(g)zip the index
  3. Open the index
  4. Perform the search

Is there anything I miss here, that would cause problems? A quick test on
a single node worked perfectly. The default compression doesn't help us
enough.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

--

--

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:
http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elasticsearch-compression-benchmark.html

Are your numbers correct? You say that an index without compressions was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

That was a stupid typo, of course 1540MB, I updated the PDF file attached:
http://www.robinverlangen.nl/assets/blog/ES-benchmark---Compress-indexes.pdf

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Clinton Gormley clint@traveljury.com

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:

http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elasticsearch-compression-benchmark.html

Are your numbers correct? You say that an index without compressions was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

--

Thanks Robin -- seeing some real numbers is always refreshing! IMO
Elasticsearch documentation dearly lacks some "ballpark figures" of
what to expect under common scenarios. Together with some general info
on how things are expected to scale (constant/linear/sublinear...)
that would already help newcomers a lot.

Btw any chance you might also be comparing query times in your setup?
Would the no/compression/tv affect that at all? (ignoring the offline
ZIP option, of course)

Best,
Radim

On Sep 19, 10:36 am, Robin Verlangen ro...@us2.nl wrote:

That was a stupid typo, of course 1540MB, I updated the PDF file attached:http://www.robinverlangen.nl/assets/blog/ES-benchmark---Compress-inde...

Best regards,

Robin Verlangen
Software engineer
*
*
Whttp://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Clinton Gormley cl...@traveljury.com

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:

http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elastics...

Are your numbers correct? You say that an index without compressions was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

--

Hi Radim,

We'll get into that later. Our application (CloudPelican) is going to
gather lots and lots of data from all kinds of different sources. We
already picked Elasticsearch out of Solor, Solandra, raw Cassandra and
Lucene for our indexing process. First important thing for us was to
determine how much storage overhead was involved. Query times are relevant,
but probably tuneable with lots of parameters.

Once we have more I'll update you over here, or you can just stay in touch
with my blog.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Radim me@radimrehurek.com

Thanks Robin -- seeing some real numbers is always refreshing! IMO
Elasticsearch documentation dearly lacks some "ballpark figures" of
what to expect under common scenarios. Together with some general info
on how things are expected to scale (constant/linear/sublinear...)
that would already help newcomers a lot.

Btw any chance you might also be comparing query times in your setup?
Would the no/compression/tv affect that at all? (ignoring the offline
ZIP option, of course)

Best,
Radim

On Sep 19, 10:36 am, Robin Verlangen ro...@us2.nl wrote:

That was a stupid typo, of course 1540MB, I updated the PDF file
attached:
http://www.robinverlangen.nl/assets/blog/ES-benchmark---Compress-inde...

Best regards,

Robin Verlangen
Software engineer
*
*
Whttp://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may
be
confidential. If you are not the intended recipient, you are reminded
that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Clinton Gormley cl...@traveljury.com

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:

http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elastics.
..

Are your numbers correct? You say that an index without compressions
was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

--

--