I would like to know if Elasticsearch documents/indices are stored in
compressed format on disk . If yes, what type of compression options are
available and it's performance overheads.
and if these compression options are configurable.
Data are both duplicated to suit different access patterns and compressed.
There are so many compression algorithms in-place that it would be hard to
be exhaustive, but we have for instance Frame-Of-Reference compression for
postings lists, LZ4 for the document store, bit packing for numeric doc
values, ...
There are no configurations options available to configure compression
besides disabling features that you don't need (such as norms on fields
that you don't score on). In the next major version of elasticsearch (2.0)
there will be a setting to enable heavier compression though (which in
practice will use DEFLATE instead of LZ4 for the document store):
I would like to know if Elasticsearch documents/indices are stored in
compressed format on disk . If yes, what type of compression options are
available and it's performance overheads.
and if these compression options are configurable.
When I loaded nearly 45m documents of test data with 3 replicas (each
document approx 2K+ bytes in size), I got following info on storage:
health status index pri rep docs.count docs.deleted store.size
pri.store.size
green open test_insert 5 3 44985382 0
414.9gb 106.4gb
This indicates there was hardly any compression on physical storage*.*
Hence my question. How do I find /estimate how much storage would be used
for X number of documents of average size of Y kilobytes each. From above
result, it appears to be no compression at all on all stored data.
Thanks
Ajay
On Tuesday, April 14, 2015 at 1:13:37 PM UTC-4, Adrien Grand wrote:
Hi,
Data are both duplicated to suit different access patterns and compressed.
There are so many compression algorithms in-place that it would be hard to
be exhaustive, but we have for instance Frame-Of-Reference compression for
postings lists, LZ4 for the document store, bit packing for numeric doc
values, ...
There are no configurations options available to configure compression
besides disabling features that you don't need (such as norms on fields
that you don't score on). In the next major version of elasticsearch (2.0)
there will be a setting to enable heavier compression though (which in
practice will use DEFLATE instead of LZ4 for the document store): Add `best_compression` option for indices by rmuir · Pull Request #8863 · elastic/elasticsearch · GitHub
On Tue, Apr 14, 2015 at 6:47 PM, <ajay....@gmail.com <javascript:>> wrote:
I would like to know if Elasticsearch documents/indices are stored in
compressed format on disk . If yes, what type of compression options are
available and it's performance overheads.
and if these compression options are configurable.
When I loaded nearly 45m documents of test data with 3 replicas (each
document approx 2K+ bytes in size), I got following info on storage:
health status index pri rep docs.count docs.deleted store.size
pri.store.size
green open test_insert 5 3 44985382 0
414.9gb 106.4gb
This indicates there was hardly any compression on physical storage*.*
Hence my question. How do I find /estimate how much storage would be used
for X number of documents of average size of Y kilobytes each. From above
result, it appears to be no compression at all on all stored data.
Compression ratios depend so much on the data that you can't really know
what the compression ratio will be without indexing sample documents.
However, once you indexed enough documents (eg. 100k), you can expect the
store size to keep growing linearly with the number of documents.
Most of time the largest part of the index is the document store. In your
case I assume that LZ4 is too lightweight a compression algorithm to manage
to compress your data efficiently. The high compression option which is
coming in elasticsearch 2.0 might help.
How much space the data takes up on disk in Elasticsearch depends a lot on
your mappings. In addition to storing the source in the _source field, all
fields are by default also copied over to the _all field to allow free text
search across all fields. In addition to this Elasticsearch also indexes
all the fields in the source document, sometimes in multiple ways, which
also takes up space. The amount of data Elasticsearch need to store can
therefore grow quite a bit before compression is applied.
You might be able to reduce the indexed size on disk by ensuring your
mappings are as efficient as possible, e.g. by disabling the _all field if
you do not need it.
Best regards,
Christian
On Tuesday, April 14, 2015 at 5:47:59 PM UTC+1, ajay....@gmail.com wrote:
I would like to know if Elasticsearch documents/indices are stored in
compressed format on disk . If yes, what type of compression options are
available and it's performance overheads.
and if these compression options are configurable.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.