Distribution of Indexing Load across specific Data Nodes for Cold Tier

mehdi-lamrani · June 14, 2024, 6:38am

Hi,

I have 4 data nodes and 1 ingest Node

Nodes 1,2,3 are full to 50%, with a capacity ot 10 Terabytes each
I attached a fourth data node, Node 4, with an ILM pointing to its storage for cold tier. This fourth node has a very large capacity of 100 Terabytes
it needs to ingest 50 terabytes of data, through an ingestion node :
There is also an ingestion Node, Node 0, that runs logstash for ingestion

My question concerns the computing power (CPU/RAM) used for indexing this ingestion.

I need to state that this is a bulk load of old data to cold directly, hence the question :

Will the computing load for this indexing be distributed across all 4 data nodes' computing power, knowing that the cold storage will not be distributed across nodes and lie on one node only ?

Thanks

DavidTurner · June 14, 2024, 7:09am

It Depends™ but if this indexing workload is sent directly to node 4, and all the relevant shards are also on node 4, then all the processing will be on node 4 too.

mehdi-lamrani · June 14, 2024, 7:20am

Exactly. That's the case.
I was hoping the indexing compute was mutualised across resources and dissociated from storage location. I know this kinda breaks the idea of data distribution but in our case it makes sense as this is cold data only and we are not keen to allocate too many resources to it, but for ingestion it will definitely be a bottleneck I guess... right ?

DavidTurner · June 14, 2024, 7:24am

That's right. Each shard is storage and compute combined, so you will need to distribute the data onto other nodes if you want to use their resources. (Except that ingest pipelines run on the receiving node, not the node with the shards, maybe that helps?)

If it were me I think I'd allocate more resources to node 4 for the initial load and then decrease them once complete.

mehdi-lamrani · June 14, 2024, 7:49am

I dont even know how would that be possible, as we are running on prem, so I'm not sure about the elasticity of local VM resources in that way (Another argument in favour of cloud-based architectures & solutions I guess.)

TYSM for your help. Truly appreciated. This was an uncertain point for us and shedding light on it will definitely help us move forward.

If you happen to have a link to where this could be documented to support your statements that would be great (I will need to report this to the technical board)

Cheers

DavidTurner · June 14, 2024, 8:03am

I think this is covered in these docs, specifically the bit which says:

Every indexing operation in Elasticsearch is first resolved to a replication group using routing, typically based on the document ID. Once the replication group has been determined, the operation is forwarded internally to the current primary shard of the group.

Topic		Replies	Views
Load unfairly distributed during large ingest Elasticsearch	6	630	May 8, 2017
Logstash and Ingest Nodes Elasticsearch ingest-pipeline	9	481	February 8, 2021
Ingest only nodes Elasticsearch	4	587	July 4, 2020
Regarding Cluster Sizing Elasticsearch	6	386	May 4, 2020
Ingestion from ES-Spark to Ingestion node Elasticsearch es-hadoop	8	654	October 9, 2020

Distribution of Indexing Load across specific Data Nodes for Cold Tier

Related topics