What's best production setup for handling 1 billion records?

Hi ,

I want to load 1 billion documents in elasticsearch.
I am using ES 1.7.1.
Format of each document is

Is it fine to load all the data into single ES node with all default settings?

Depends on a few things, you really should try to see if it fits on a node of whatever size you have.

I am running ES with ES_HEAP_SIZE=16g and diskspace is 500G.

Have you tried loading the data into a node with those specs?
What was the indexing and query response like?

I have made one change in specs, I have set index.number_of_shards: 1 and index.number_of_replicas: 0 .

Query is taking ~15 sec but CPU utilization is around 400%.

What sort of query is it?

I am using spring-data-elasticsearch 1.3.0.RELEASE

Most of the queries are
@Query("{"bool" : {"should" : [{"query_string" : { "query" : "?0", "fields":["dname"]}} , {"query_string" : { "query" : "?1", "fields":["dname"]}} ]}}"")

for eg.

return all documents where dname="abc.com" or dname="abc.com*"


logical and on pId AND sid combination.


for eg. return all documents where pid=X AND sid=Y

Can anyone help me on this ?

A leading wildcard query like that will always be slow, it's essentially the ES version of a table scan.

Hi @warkolm,

My query is not start with leading wildcard, It's trailing wildcard query.


The ? is leading though. As per https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-query-string-query.html#_wildcards it'll only look for a single char in there, but it still needs to scan a lot of docs to return just those.

I am using spring-data-elasticsearch.

below is the code snippet.

@Query("{"bool" : {"should" : [{"query_string" : { "query" : "?0", "fields":["dname"]}} , {"query_string" : { "query" : "?1", "fields":["dname"]}} ]}}"")
Page findByDomainOrDomainStartsWith(String domain , String domain1 , Pageable pageable);

"?0" -> domain
"?1" -> domain1

Ah ok, my misunderstanding then. Sorry!