Minimum hardware requirement for 500 million documents

puneetm · August 24, 2015, 7:16am

Hi,

I have a document with about 15 fields and I want to index 500million documents (around 350 GB of data) with elastisearch. I also will be querying this data with medium to complex queries including aggregations etc.

Can somebody please suggest the minimum hardware configuration that I should take into account for this ES server with one node?

Thanks
Puneet

warkolm · August 24, 2015, 7:43am

You can easily store that amount of data on a single JVM and the max heap you want for a single JVM is 30GB.
Ideally you should test to see how your data structure applies to your use case and Elasticsearch.

puneetm · August 25, 2015, 6:44am

Hi,

Thanks for the reply.

Following are some of the follow up queries:

I tested with 25 million records, and found that I can query my data with 1GB of allocated heap.
But, if I increase the records to 30million and then query the data with 1GB of allocated heap, I get the following exception:"

    "UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException:  
     [FIELDDATA] Data too large, data for [bad_score] would be larger than limit of  
     [623326003/594.4mb]]; nested: CircuitBreakingException[[FIELDDATA] Data too large, data for 
     [----] would be larger than limit of [623326003/594.4mb]]; }
    at   org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
    at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)"

Please let me know in detail what are the different strategies to avoid this error in the Prod environment, Do we need to keep monitoring the elasticsearch nodes and then scale horizontally as and when required?
Can I assume that if I increase my data n times of 25 million then I also need to increase my heap memory by n of 1GB?

warkolm · August 25, 2015, 7:19am

Look at doc values.
It's more complicated than that, but hard to quantify. You need to test.