Aggressive index compression

Robin_Verlangen · September 18, 2012, 1:09pm

Hi there,

We're willing to index lots of data, however after a certain period of time
it doesn't have to be as "hot" as data from the past week. What do you
experts think of the following:

Open index
Write lots of data into it

Index becomes less important after 7 days

Close index
(g)zip the index

Index remains gzip

Request for search on index
Un(g)zip the index
Open the index
Perform the search

Is there anything I miss here, that would cause problems? A quick test on a
single node worked perfectly. The default compression doesn't help us
enough.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

kimchy · September 18, 2012, 7:47pm

The relatively new compression option should do the trick without the need to gzip your index (you won't gain that much). More details here: Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Sep 18, 2012, at 3:09 PM, Robin Verlangen robin@us2.nl wrote:

Hi there,

We're willing to index lots of data, however after a certain period of time it doesn't have to be as "hot" as data from the past week. What do you experts think of the following:

Open index

Write lots of data into it

Index becomes less important after 7 days

Close index

(g)zip the index

Index remains gzip

Request for search on index

Un(g)zip the index

Open the index

Perform the search

Is there anything I miss here, that would cause problems? A quick test on a single node worked perfectly. The default compression doesn't help us enough.

Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.

--

--

Robin_Verlangen · September 19, 2012, 7:29am

Thank you for the reference, however I was already aware of those options.
A quick benchmark gave us indication we could still win a lot: I'll publish
the details in here soon!

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/18 Shay Banon kimchy@gmail.com

The relatively new compression option should do the trick without the need
to gzip your index (you won't gain that much). More details here:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Sep 18, 2012, at 3:09 PM, Robin Verlangen robin@us2.nl wrote:

Hi there,

We're willing to index lots of data, however after a certain period of
time it doesn't have to be as "hot" as data from the past week. What do you
experts think of the following:

Open index

Write lots of data into it

Index becomes less important after 7 days

Close index

(g)zip the index

Index remains gzip

Request for search on index

Un(g)zip the index

Open the index

Perform the search

Is there anything I miss here, that would cause problems? A quick test on
a single node worked perfectly. The default compression doesn't help us
enough.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

--

--

Robin_Verlangen · September 19, 2012, 8:13am

I published my findings on Elasticsearch compression on my personal blog
with some in-depth information about the benchmark:
http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elasticsearch-compression-benchmark.html

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Robin Verlangen robin@us2.nl

Thank you for the reference, however I was already aware of those options.
A quick benchmark gave us indication we could still win a lot: I'll publish
the details in here soon!

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/18 Shay Banon kimchy@gmail.com

The relatively new compression option should do the trick without the
need to gzip your index (you won't gain that much). More details here:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Sep 18, 2012, at 3:09 PM, Robin Verlangen robin@us2.nl wrote:

Hi there,

We're willing to index lots of data, however after a certain period of
time it doesn't have to be as "hot" as data from the past week. What do you
experts think of the following:

Open index

Write lots of data into it

Index becomes less important after 7 days

Close index

(g)zip the index

Index remains gzip

Request for search on index

Un(g)zip the index

Open the index

Perform the search

Is there anything I miss here, that would cause problems? A quick test on
a single node worked perfectly. The default compression doesn't help us
enough.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

--

--

Clinton_Gormley · September 19, 2012, 8:26am

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:
http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elasticsearch-compression-benchmark.html

Are your numbers correct? You say that an index without compressions was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

Robin_Verlangen · September 19, 2012, 8:36am

That was a stupid typo, of course 1540MB, I updated the PDF file attached:
http://www.robinverlangen.nl/assets/blog/ES-benchmark---Compress-indexes.pdf

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Clinton Gormley clint@traveljury.com

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:

http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elasticsearch-compression-benchmark.html

Are your numbers correct? You say that an index without compressions was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

--

Radim · September 19, 2012, 8:02pm

Thanks Robin -- seeing some real numbers is always refreshing! IMO
Elasticsearch documentation dearly lacks some "ballpark figures" of
what to expect under common scenarios. Together with some general info
on how things are expected to scale (constant/linear/sublinear...)
that would already help newcomers a lot.

Btw any chance you might also be comparing query times in your setup?
Would the no/compression/tv affect that at all? (ignoring the offline
ZIP option, of course)

Best,
Radim

On Sep 19, 10:36 am, Robin Verlangen ro...@us2.nl wrote:

That was a stupid typo, of course 1540MB, I updated the PDF file attached:http://www.robinverlangen.nl/assets/blog/ES-benchmark---Compress-inde...

Best regards,

Robin Verlangen
Software engineer
*
*
Whttp://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Clinton Gormley cl...@traveljury.com

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:

http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elastics...

Are your numbers correct? You say that an index without compressions was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

--

Robin_Verlangen · September 20, 2012, 6:39am

Hi Radim,

We'll get into that later. Our application (CloudPelican) is going to
gather lots and lots of data from all kinds of different sources. We
already picked Elasticsearch out of Solor, Solandra, raw Cassandra and
Lucene for our indexing process. First important thing for us was to
determine how much storage overhead was involved. Query times are relevant,
but probably tuneable with lots of parameters.

Once we have more I'll update you over here, or you can just stay in touch
with my blog.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Radim me@radimrehurek.com

Thanks Robin -- seeing some real numbers is always refreshing! IMO
Elasticsearch documentation dearly lacks some "ballpark figures" of
what to expect under common scenarios. Together with some general info
on how things are expected to scale (constant/linear/sublinear...)
that would already help newcomers a lot.

Btw any chance you might also be comparing query times in your setup?
Would the no/compression/tv affect that at all? (ignoring the offline
ZIP option, of course)

Best,
Radim

On Sep 19, 10:36 am, Robin Verlangen ro...@us2.nl wrote:

That was a stupid typo, of course 1540MB, I updated the PDF file
attached:
http://www.robinverlangen.nl/assets/blog/ES-benchmark---Compress-inde...

Best regards,

Robin Verlangen
Software engineer
*
*
Whttp://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may
be
confidential. If you are not the intended recipient, you are reminded
that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/19 Clinton Gormley cl...@traveljury.com

Hi Robin

On Wed, 2012-09-19 at 10:13 +0200, Robin Verlangen wrote:

I published my findings on Elasticsearch compression on my personal
blog with some in-depth information about the benchmark:

http://www.robinverlangen.nl/index/view/50597e5876ad1-6e7e5f/elastics.
..

Are your numbers correct? You say that an index without compressions
was
3,282 MB but an index with compression was 15,400 MB ??? ie 5 times
BIGGER?

clint

--

--

--

Topic		Replies	Views
Compress : true flag Elasticsearch	9	1646	July 6, 2017
Any experience with ES and Data Compressing Filesystems? Elasticsearch	12	3424	July 6, 2017
Reducing Disk Space Requirements/ Deduplication? Zipping? Elasticsearch	5	2334	July 6, 2017
Disk space filled up with ES indices Elasticsearch	7	5484	July 6, 2017
Compresstion in ES 1.2.1 Elasticsearch	13	540	July 6, 2017

Aggressive index compression

Related topics