Configuration for a future cluster with 30b documents

azro352 · July 3, 2019, 8:46pm

I have to design an ES cluster, I'm far from understanding everything, I've read different articles (like those of Fred de Villamil ( Designing the Perfect Elasticsearch Cluster)) and ES documentation, and I have no IRL contact to discuss with, so here a I to get some advices about my first thoughts

The purpose is not a continuous insertion in the cluster, but many huge sets of data (most of all inserted at the beginning) to be requested, and updated : some fields will be updated weekly

The reasons why I choose ES were : json elasticity (each doc won't be updated the ame way) and scalability

Info about :

~30 billions documents (start at 20b, may reach 40b)
~300B/doc, between 200-400B
only one index (pertinent ?)
few agregations, most is queries ('filter' and not 'query' (no need of scoring))

This gives ~9 TB of data

Supposed index config : 512 shards (15GB data/shard)

Supposed cluster config:

1 node, 96-128GB ram (24-30 for Heap)
3 nodes :
- 1 master
- 2 data : 64/96/128GB ram (24-30 for Heap)

I'm here to listen any constructive advicee on the cluster configuration, and maybe some explanations on the real purpose of having 1 master+2 data node instead of 1 master&data + 2 data nodes ?

Thanks

Christian_Dahlqvist · July 4, 2019, 4:59am

Do you have any requirements around high availability? What type of queries will you be using? Do you have any latency requirements for the queries?

azro352 · July 4, 2019, 7:17am

The data have to be available most of the time, there is no precise requirement but we don't want the index unavailable each time we insert or update data because it's too much

Queries will be like 5-6 criterias with match/exists/regexp, and there is no specific requirement but we need still speed here

Also, insertion and update is a big part, as we can insert 200m-500m elements in one time, or 100 sets of 10m one after another (I use bulk API with curl call).
For update it could be several millions document at a time, but update the value of 1-2 fields, with curl call too

For insert and update, my python code writes the json commands in files and then send them via curl to ES, it's the quickest way I've found

system · August 1, 2019, 7:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimal configuration of the ES cluster Elasticsearch	8	576	October 6, 2020
Server config for cluster Elasticsearch	2	404	January 12, 2020
Es nodes configuration Elasticsearch	3	246	October 14, 2021
ES Recommended Configuration? Elasticsearch	3	928	July 6, 2017
Few Queries regarding Producion Cluster Configuration Elasticsearch	4	398	March 27, 2017

Configuration for a future cluster with 30b documents

Related topics