Capacity Planning Guidelines? (estimating index size)


(Schnyder) #1

We're kicking off a project that will involve indexing terrabytes of data. We're considering using ElasticSearch for the job. However I need to determine the hardware requirements to hold such a large index.

Are there any guidelines to help estimate the size of an index relative to the size of the source data? For instance, if index 100MB of new JSON data, how much can I expect ElasticSearch's index to grow as a result?

Any advice would be GREATLY appreciated.

Thanks,
Chris


(Karussell) #2

The hardware requirements also depend on what you want to do with it
e.g. how much traffic?

As a rule of thumb I would say that a lucene index is a bit smaller
than the actual data. BUT it really depends on what things of the data
should be indexed or if there are stored field, if you use the _all
field or the _sources etc. I would suggest to setup a test index of
those 100MB and see it in real life.

Also: if you are about to index the things into one index it will get
slower and slower, so maybe you setup some index rolling mechanism (or
play with the shard count) - especially if this is none-static data.

On 20 Okt., 16:57, Schnyder chris.schny...@cardinal-holdings.com
wrote:

We're kicking off a project that will involve indexing terrabytes of data.
We're considering using ElasticSearch for the job. However I need to
determine the hardware requirements to hold such a large index.

Are there any guidelines to help estimate the size of an index relative to
the size of the source data? For instance, if index 100MB of new JSON data,
how much can I expect ElasticSearch's index to grow as a result?

Any advice would be GREATLY appreciated.

Thanks,
Chris


(Otis Gospodnetić) #3

Hi,

While this won't answer all your questions directly (there is no exact
answer without knowing all details and, really, without doing some
tests), have a look at the disk & memory size estimator for Lucene/
Solr - http://search-lucene.com/?q=size+estimator&fc_project=Lucene&fc_project=Solr
. Parts of this will be applicable to ElasticSearch, but of course
even this estimator is not perfect.

Otis

Check out Search Analytics SaaS - http://sematext.com/search-analytics/index.html

On Oct 20, 10:57 am, Schnyder chris.schny...@cardinal-holdings.com
wrote:

We're kicking off a project that will involve indexing terrabytes of data.
We're considering using ElasticSearch for the job. However I need to
determine the hardware requirements to hold such a large index.

Are there any guidelines to help estimate the size of an index relative to
the size of the source data? For instance, if index 100MB of new JSON data,
how much can I expect ElasticSearch's index to grow as a result?

Any advice would be GREATLY appreciated.

Thanks,
Chris

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/Capacity-Planning-Gui...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Shay Banon) #4

You will have to do some capacity tests with a smaller set of data. Some
points to think about:

  1. By default, _source is stored (the actual json you added). It usually
    make sense to turn on compression for it.
  2. _all is by default enabled, meaning that on top of all the specific
    fields being indexed, another field which aggregates all of them is also
    indexed. It makes searching much simpler, but does add an overhead. You can
    completely disable it or pick and choose in the mappings if fields should be
    included in all or not.
  3. You might not need to index all the json fields, if you have some that
    you don't need to index, you can map those with index set to no.

-shay.banon

On Thu, Oct 20, 2011 at 4:57 PM, Schnyder <
chris.schnyder@cardinal-holdings.com> wrote:

We're kicking off a project that will involve indexing terrabytes of data.
We're considering using ElasticSearch for the job. However I need to
determine the hardware requirements to hold such a large index.

Are there any guidelines to help estimate the size of an index relative to
the size of the source data? For instance, if index 100MB of new JSON
data,
how much can I expect ElasticSearch's index to grow as a result?

Any advice would be GREATLY appreciated.

Thanks,
Chris

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Capacity-Planning-Guidelines-estimating-index-size-tp3437936p3437936.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #5