We're kicking off a project that will involve indexing terrabytes of data. We're considering using ElasticSearch for the job. However I need to determine the hardware requirements to hold such a large index.
Are there any guidelines to help estimate the size of an index relative to the size of the source data? For instance, if index 100MB of new JSON data, how much can I expect ElasticSearch's index to grow as a result?
The hardware requirements also depend on what you want to do with it
e.g. how much traffic?
As a rule of thumb I would say that a lucene index is a bit smaller
than the actual data. BUT it really depends on what things of the data
should be indexed or if there are stored field, if you use the _all
field or the _sources etc. I would suggest to setup a test index of
those 100MB and see it in real life.
Also: if you are about to index the things into one index it will get
slower and slower, so maybe you setup some index rolling mechanism (or
play with the shard count) - especially if this is none-static data.
We're kicking off a project that will involve indexing terrabytes of data.
We're considering using Elasticsearch for the job. However I need to
determine the hardware requirements to hold such a large index.
Are there any guidelines to help estimate the size of an index relative to
the size of the source data? For instance, if index 100MB of new JSON data,
how much can I expect Elasticsearch's index to grow as a result?
While this won't answer all your questions directly (there is no exact
answer without knowing all details and, really, without doing some
tests), have a look at the disk & memory size estimator for Lucene/
Solr - http://search-lucene.com/?q=size+estimator&fc_project=Lucene&fc_project=Solr
. Parts of this will be applicable to Elasticsearch, but of course
even this estimator is not perfect.
We're kicking off a project that will involve indexing terrabytes of data.
We're considering using Elasticsearch for the job. However I need to
determine the hardware requirements to hold such a large index.
Are there any guidelines to help estimate the size of an index relative to
the size of the source data? For instance, if index 100MB of new JSON data,
how much can I expect Elasticsearch's index to grow as a result?
You will have to do some capacity tests with a smaller set of data. Some
points to think about:
By default, _source is stored (the actual json you added). It usually
make sense to turn on compression for it.
_all is by default enabled, meaning that on top of all the specific
fields being indexed, another field which aggregates all of them is also
indexed. It makes searching much simpler, but does add an overhead. You can
completely disable it or pick and choose in the mappings if fields should be
included in all or not.
You might not need to index all the json fields, if you have some that
you don't need to index, you can map those with index set to no.
We're kicking off a project that will involve indexing terrabytes of data.
We're considering using Elasticsearch for the job. However I need to
determine the hardware requirements to hold such a large index.
Are there any guidelines to help estimate the size of an index relative to
the size of the source data? For instance, if index 100MB of new JSON
data,
how much can I expect Elasticsearch's index to grow as a result?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.