Index to red state cluster

I have a question regarding shard selection during index/bulk operations in Elasticsearch version 6.8.6.

In my cluster, I have three data nodes: A, B, and C. The shards (with no replicas) are evenly allocated across these nodes, with shards 0, 1, and 2 assigned to each node respectively.

I've noticed that when any of the data nodes crash, all the index requests are still being routed to the target shard/node, causing a timeout exception. It seems that the indexRequest route policy does not exclude the crashed node/shard from the selection process.

I would like to confirm the following:

  1. Is my observation and analysis correct?
  2. Why does the default route policy include the crashed node in the routing selection? Is it because Elasticsearch aims to maintain consistent route results?
  3. Is there any way to route the document to an available shard?

I appreciate any insights or suggestions you can provide. Thank you.

Elasticsearch version 6.8.6 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

Elasticsearch will as far as I know not allow you to index data into an index in red state, which will be the case when you have lost a primary shard.

Index operations are assigned to a shard based on hashing of the document ID. As operations could be updates they can not necessarily be rerouted without causing problems.

No, not that I am aware of. I would recommend you instead configure a replica shard.

1 Like

Thank you, Christian.

But I have try and confirm I do index into a red state index. (but only into online shards ofcause)

And the primary reason for opting to maintain zero replicas is the cost of resources. We are not concerned if some nodes experience crashes (typically, they recover quickly). During a node crash, the only operation affected is querying the offline shard, which is acceptable in our scenario. However, what troubles us is that all index requests routed to the offline shard will wait until timeout, significantly reducing the transactions per second (TPS), particularly for bulk requests. A single index timeout within the bulk request will cause the entire operation to wait until a timeout reply.

In conclusion, it appears sensible to exclusively select online shards as the index target when employing automatically generated document IDs. As you mentioned, the document ID influences shard and node selection. We can explore ways to ensure that automatically generated document IDs exclusively reference online shards.

You can override the hashing of the document ID determining the shard to write to through document routing, which is probably easier to use rather than try to assign appropriate IDs client side. This also avoids the negative impact using custom IDs have on indexing performance.

But you can not avoid to select the offline shard by change routing or hashing only.

maybe we need to do something in the server side instead of client.

I do not think there is anything you can do. This is why adding replicas is the recommended solution.

The lack of resiliency you have experienced is an trade-off when running without replica shards enabled.

This timeout is under your control via the ?timeout= query parameter to the bulk API. It defaults to 1m but if you want a shorter (or even zero) timeout then that's fine too.

Thanks David,

I have modified the default timeout to 10s which still had a considerable impact to TPS while the normal request will cost less than 1s. And the situation will become worse in bulk since any single index latency will cause the whole bulk to wait until timeout.

It seems I have to change the default pattern by rewrite the route code.

Indeed, we are faced with a trade-off between the advantages of low cost and the potential risk of data loss, versus the higher cost (at least doubling) that comes with maintaining replicas. However, considering the exorbitant expense associated with doubling the cost, we must opt for the first option. In our environment, we can accept a minimal loss of data. I believe this is a reasonable and widely shared requirement for certain scenarios.

Whatever, thank you for your understanding and reply.

PS. do you guys thinks it is appropriate to propose a feature request issue regarding this matter?

Why 10s? If you want these things to fail fast, why not zero?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.