Let's say I have a 100GB of data that need to be indexed into a specific index with 5 shards.
I don't have reads during indexing time and I want to speed up the process as much as possible.
I have 5 (python) workers that can write in parallel to this index in bulks.
I wonder if adding a _routing field with the worker_id (1..5) might help to improve the indexing performance.
Theoretically, I want that each worker will write into a specific shard, assuming those shards are evenly distributed among 5 different nodes. That way, elastic won't have to move documents internally between shards.
Is this intuition is right or is it work differently? am I missing something?
It will result in complete bulk requests going to a single shard, and as you have multiple workers all shards will be busy. If the request hits the correct node it may improve performance, but I am not sure how much difference it will make. It should not make performance worse.
I suspect increasing the number of worker threads may make a bigger difference. One indexing thread per shard sounds low.
Thanks Christian! I am indexing with more than one thread per shard, I use it as an example to easily explain the other point (the routing question).
My real index has more than that and I have more nodes than that, I just want to understand if using routing can improve the performance (or it's a waste of time).
Actually I have a 500GB indices, indexed once a day (from scratch) and I want to index them as fast as possible.
Each of them has 20 shards and I have 40 data nodes.
I am trying to imagine the internal routing of a single bulk request of 500 docs. Theoretically, the node that received those 500 docs needs to reroute the docs to the right shard on other different nodes. If each one of nodes is busy with routing (additionally to indexing) - it might slow down indexing, isn't that?
If you can route bulk requests directly to the node that holds the primary shard all documents within that bulk request will go into you will be able to reduce network traffic, which could lead to better performance if that is something that is indeed a bottleneck.
No, there is no efficient way to do that. You will need to identify which shard a specific routing value leads to and which node the primary shard resides on. This can naturally change if Elasticsearch reroutes shard due to events in the cluster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.