Improve indexing performance speed by routing to a specific shard

Itay_Bittan · February 27, 2023, 8:21am

Hi,

Let's say I have a 100GB of data that need to be indexed into a specific index with 5 shards.
I don't have reads during indexing time and I want to speed up the process as much as possible.
I have 5 (python) workers that can write in parallel to this index in bulks.
I wonder if adding a _routing field with the worker_id (1..5) might help to improve the indexing performance.
Theoretically, I want that each worker will write into a specific shard, assuming those shards are evenly distributed among 5 different nodes. That way, elastic won't have to move documents internally between shards.
Is this intuition is right or is it work differently? am I missing something?

Thanks

Christian_Dahlqvist · February 27, 2023, 8:45am

Is your data immutable?

Itay_Bittan · February 27, 2023, 9:23am

yes it is (I create a new fresh index every day) and after indexing it becomes read-only index

Christian_Dahlqvist · February 27, 2023, 9:36am

It will result in complete bulk requests going to a single shard, and as you have multiple workers all shards will be busy. If the request hits the correct node it may improve performance, but I am not sure how much difference it will make. It should not make performance worse.

I suspect increasing the number of worker threads may make a bigger difference. One indexing thread per shard sounds low.

Itay_Bittan · February 27, 2023, 9:54am

Thanks Christian! I am indexing with more than one thread per shard, I use it as an example to easily explain the other point (the routing question).

My real index has more than that and I have more nodes than that, I just want to understand if using routing can improve the performance (or it's a waste of time).
Actually I have a 500GB indices, indexed once a day (from scratch) and I want to index them as fast as possible.
Each of them has 20 shards and I have 40 data nodes.
I am trying to imagine the internal routing of a single bulk request of 500 docs. Theoretically, the node that received those 500 docs needs to reroute the docs to the right shard on other different nodes. If each one of nodes is busy with routing (additionally to indexing) - it might slow down indexing, isn't that?

thanks again!

Christian_Dahlqvist · February 27, 2023, 10:00am

If you can route bulk requests directly to the node that holds the primary shard all documents within that bulk request will go into you will be able to reduce network traffic, which could lead to better performance if that is something that is indeed a bottleneck.

Itay_Bittan · February 27, 2023, 10:08am

do I have another way for doing that? (routing to the node that holds the primary shard) - rather than using the _routing field?

Christian_Dahlqvist · February 27, 2023, 10:16am

No, there is no efficient way to do that. You will need to identify which shard a specific routing value leads to and which node the primary shard resides on. This can naturally change if Elasticsearch reroutes shard due to events in the cluster.

system · March 27, 2023, 10:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Routing performance tuning Elasticsearch	5	1462	July 5, 2017
Does "routing" always improve performance? Elasticsearch	2	717	July 6, 2017
How to increase indexing speed after configuring routing on an index? Elasticsearch	3	335	July 6, 2017
[SOLVED] Customing document routing Elasticsearch	7	795	July 5, 2017
How to route docs of same _routing key in Elasticsearch into multiple shard? Elasticsearch	3	116	April 8, 2024

Improve indexing performance speed by routing to a specific shard

Related topics