Configuration for a future cluster with 30b documents

I have to design an ES cluster, I'm far from understanding everything, I've read different articles (like those of Fred de Villamil ( Designing the Perfect Elasticsearch Cluster)) and ES documentation, and I have no IRL contact to discuss with, so here a I to get some advices about my first thoughts


The purpose is not a continuous insertion in the cluster, but many huge sets of data (most of all inserted at the beginning) to be requested, and updated : some fields will be updated weekly

The reasons why I choose ES were : json elasticity (each doc won't be updated the ame way) and scalability

Info about :

  1. ~30 billions documents (start at 20b, may reach 40b)
  2. ~300B/doc, between 200-400B
  3. only one index (pertinent ?)
  4. few agregations, most is queries ('filter' and not 'query' (no need of scoring))

This gives ~9 TB of data

Supposed index config : 512 shards (15GB data/shard)

Supposed cluster config:

  1. 1 node, 96-128GB ram (24-30 for Heap)

  2. 3 nodes :

    • 1 master
    • 2 data : 64/96/128GB ram (24-30 for Heap)

I'm here to listen any constructive advicee on the cluster configuration, and maybe some explanations on the real purpose of having 1 master+2 data node instead of 1 master&data + 2 data nodes ?

Thanks

Do you have any requirements around high availability? What type of queries will you be using? Do you have any latency requirements for the queries?

The data have to be available most of the time, there is no precise requirement but we don't want the index unavailable each time we insert or update data because it's too much

Queries will be like 5-6 criterias with match/exists/regexp, and there is no specific requirement but we need still speed here

Also, insertion and update is a big part, as we can insert 200m-500m elements in one time, or 100 sets of 10m one after another (I use bulk API with curl call).
For update it could be several millions document at a time, but update the value of 1-2 fields, with curl call too

For insert and update, my python code writes the json commands in files and then send them via curl to ES, it's the quickest way I've found

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.