How data will be distributed among the data nodes

NK2812 · December 29, 2021, 9:51am

Hi,
I have cluster with 3 master nodes and 3 data nodes and when I am trying to add the documents/data to cluster it seems only 2 data nodes are used.

I have only one index and it is same for all the data nodes.

if you look at the below API response, only es-data-1 and es-data-2 nodes are used, data is not inserting to es-data-0.

I want to understand how the data distribution happens and when my es-data-0 node will be used to store the data. Please help me on the same.

https://abc.xyz.com/cluster-name/_cat/allocation?v=true&format=json

[

    {

        "shards": "0",

        "disk.indices": "0b",

        "disk.used": "18.5mb",

        "disk.avail": "957.3mb",

        "disk.total": "975.8mb",

        "disk.percent": "1",

        "host": "1.x.x.x",

        "ip": "1.x.x.x",

        "node": "es-data-0"

    },

    {

        "shards": "1",

        "disk.indices": "360.4mb",

        "disk.used": "379mb",

        "disk.avail": "596.8mb",

        "disk.total": "975.8mb",

        "disk.percent": "38",

        "host": "1.x.x.x",

        "ip": "1.x.x.x",

        "node": "es-data-2"

    },

    {

        "shards": "1",

        "disk.indices": "360.2mb",

        "disk.used": "378.9mb",

        "disk.avail": "596.9mb",

        "disk.total": "975.8mb",

        "disk.percent": "38",

        "host": "1.x.x.x",

        "ip": "1.x.x.x",

        "node": "es-data-1"

    }

]

HenningAndersen · December 29, 2021, 10:23am

Hi @NK2812 ,

Elasticsearch distributes data in "shards". By default, each index has one primary shard and one replica shard. This is why you see only two nodes used.

If you expect to use more than one index, that configuration may be just fine, since additional indices will then ensure that all data nodes are used.

If you have just one index (or primarily use one index), you can set number_of_shards when creating the index.

Notice that if you use data-streams, more indices (and thus shards) will be created on rollover.

If you prefer additional replicas of data, you can also adjust number_of_replicas instead. This will allow searches to go to all of the data nodes.

Hope this helps.

NK2812 · December 29, 2021, 11:07am

Hi @HenningAndersen

Thanks for your reply.

So the node es-data-0 has different shard "shards": "0" than es-data-1 and es-data-2 "shards": "1".
after es-data-1 shard is full then will it use the es-data-0 shard ?

My use case is, I want to increase the data nodes when storage used is 80% or more. Lets say for example when I created cluster initially I will have 2 data nodes and 3 master nodes. Data nodes has the storage capacity of 1GB, so when consumer added data and storage used is 80%, I would like to increase the data node so that I have 1GB storage is added to the cluster. I am just increasing the data node and changing any other configuration, so index is still same.

I am creating cluster using ECK operator and when storage is 80% I want just update replica count of data node configuration and additional data node is added to cluster.

HenningAndersen · December 29, 2021, 6:48pm

Hi @NK2812

I think we need a bit more background to properly help you design this. What kind of data are you trying to index and is this time-based data like logs and metrics? If so, you should use data streams and let ILM rollover when the size of indices reach a threshold.

If this is more of a regular search case with just one index with all data, there is a bit more planning involved in order to ensure that you can grow the cluster with the data. Ideally, you should establish a max size then and plan for a number of shards to hit around 50GB per shard for your max volume (see also the recommendations here).

NK2812 · December 30, 2021, 6:29am

Hi @HenningAndersen

This is for regular search with just one index with all the data. By default we will be creating cluster with 2 data nodes with each node 5GB of the storage. Once the cluster is ready consumer keep adding the data in to cluster and once storage is full service becomes unhealthy and throws 503.
So I want to increase the storage as when it reaches some threshold so that service will be available and end consumer can use the cluster.

I was thinking to increase the data pods with 5GB of storage and this process keep happens when storage meets threshold and storage is 50GB including all the data pods. I do not want to allocate 50GB storage initially when I create cluster because we do not whether consumer actually needs 50GB while cluster creation. And we want increase the storage only when consumer used some 70 or 80 % of the current storage.

Please suggest what is the best way to increase storage dynamically.

HenningAndersen · December 30, 2021, 8:14am

Hi @NK2812 ,

if your target is in GBs, it seems likely that a 2 data node setup can handle this. Rather than scale out horizontally, I would advice to scale up vertically by expanding the volumes (or replacing nodes with bigger volumes in a rolling fashion, I think you need to rename the node-set in ECK to do this). Nodes with only 5 or 50GB disk are quite small.

NK2812 · December 30, 2021, 11:37am

Thanks for your advice @HenningAndersen .

One question just to understand, lets say I have one index and two data nodes with 1GB storage, so default it will have 1 primary shard on node-1 and its replica on node-2. so if I add one more data node which is node-3 after sometimes, so index still using first 2 nodes which created earlier.

When will be node-3 used ? is it like as index has only two shards created by default on first 2 nodes and it will never use node-3 ?

leandrojmp · December 30, 2021, 1:13pm

If your index has just 1 primary shard and 1 replica, then it will only use 2 data nodes, if you add more data nodes they will be used when you create new indices as elastic try to balance the number of shards between the nodes, or if you increase the number of replicas.

HenningAndersen · December 30, 2021, 5:09pm

With 1 index, 1 primary and 1 replica you can only use 2 data nodes. You can increase that by changing number_of_shards on the index. The problem is to what number and for that we need a bit of information to help guide you.

Can you provide a rough estimate of the amount of data you will need in the cluster? Is it 10GB, 100GB, 1TB, 10TB, 100TB?
Is there a max disk size per node that you want to use due to the underlying hardware?

This page has information on desired shard sizes.

system · January 27, 2022, 5:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How is ES data distributed in each data nodes Elasticsearch	2	319	June 2, 2021
3 Elasticsearch node cluster Elasticsearch	9	1032	July 6, 2017
How does ES stores the data? Elasticsearch	4	219	May 31, 2022
Elasticsearch cluster shards allocation not even Elasticsearch	8	2615	April 3, 2019
Data distribute in large cluster with many indices Elasticsearch	4	616	September 23, 2017

How data will be distributed among the data nodes

Related topics