Distribution of Indexing Load across specific Data Nodes for Cold Tier

Hi,

I have 4 data nodes and 1 ingest Node

  • Nodes 1,2,3 are full to 50%, with a capacity ot 10 Terabytes each

  • I attached a fourth data node, Node 4, with an ILM pointing to its storage for cold tier. This fourth node has a very large capacity of 100 Terabytes
    it needs to ingest 50 terabytes of data, through an ingestion node :

  • There is also an ingestion Node, Node 0, that runs logstash for ingestion

My question concerns the computing power (CPU/RAM) used for indexing this ingestion.

I need to state that this is a bulk load of old data to cold directly, hence the question :

Will the computing load for this indexing be distributed across all 4 data nodes' computing power, knowing that the cold storage will not be distributed across nodes and lie on one node only ?

Thanks

It Depends™ but if this indexing workload is sent directly to node 4, and all the relevant shards are also on node 4, then all the processing will be on node 4 too.

1 Like

Exactly. That's the case.
I was hoping the indexing compute was mutualised across resources and dissociated from storage location. I know this kinda breaks the idea of data distribution but in our case it makes sense as this is cold data only and we are not keen to allocate too many resources to it, but for ingestion it will definitely be a bottleneck I guess... right ?

That's right. Each shard is storage and compute combined, so you will need to distribute the data onto other nodes if you want to use their resources. (Except that ingest pipelines run on the receiving node, not the node with the shards, maybe that helps?)

If it were me I think I'd allocate more resources to node 4 for the initial load and then decrease them once complete.

1 Like

I dont even know how would that be possible, as we are running on prem, so I'm not sure about the elasticity of local VM resources in that way (Another argument in favour of cloud-based architectures & solutions I guess.)

TYSM for your help. Truly appreciated. This was an uncertain point for us and shedding light on it will definitely help us move forward.

If you happen to have a link to where this could be documented to support your statements that would be great (I will need to report this to the technical board)

Cheers

I think this is covered in these docs, specifically the bit which says:

Every indexing operation in Elasticsearch is first resolved to a replication group using routing, typically based on the document ID. Once the replication group has been determined, the operation is forwarded internally to the current primary shard of the group.

1 Like