Controlling Primary/Replica shards allocation

Recently upgraded from 1.5 to 2.3.4.

For most of our indexes, we have 1 shard per node, with replica set to 1. This leads to 2 shards per index per node. It appears that since we moved from 1.5 to 2.3, the cluster has an increased tendency to allocate 2 primaries on one node and 2 replicas on another node., instead of the usual 1 primary and 1 replica on each node (maybe this was just luck, unclear).

Accordingly to the documentation, this should not be a problem as primaries and replicas do equal amounts of work. However, during bulk indexing, it seems to create an uneven work load that that is slowing down the overall indexing process a bit.

During bulk indexing, we send an equal amount of documents to each node. Observed on multiple clusters of various sizes that the indexing rate (proportional to index_total / index_time_in_millis in node stats) is lower on the nodes that have 2 primary shards than those that have 2 replica shards.

Consider this:

curl localhost:9200/_cat/shards/eflow_2016_11_03 | sort -k 8,8
eflow_2016_11_03 0 p STARTED 89807695 39.8gb 192.168.1.1 cdh-1
eflow_2016_11_03 2 p STARTED 89820639 39gb 192.168.1.1 cdh-1
eflow_2016_11_03 0 r STARTED 89807695 38.9gb 192.168.1.2 cdh-2
eflow_2016_11_03 1 p STARTED 89812472 44.1gb 192.168.1.2 cdh-2
eflow_2016_11_03 1 r STARTED 89812472 38.9gb 192.168.1.3 cdh-3
eflow_2016_11_03 2 r STARTED 89820639 38.9gb 192.168.1.3 cdh-3

Per-nodes indexing rates (index_total) / (index_time_in_millis / 1000):
Host cores docs/s


cdh-1 32 10230.6
cdh-2 32 13644.1
cdh-3 32 18289.6

This discrepancy is causing the total indexing time to longer than when all the nodes have both a primary and a replica allocated. There does not seem to be any configuration that can help with controlling allocation that way. The manual shard allocation process can result in proper shard allocation, which seems to result in faster indexing, but is not usable in production.

Our clusters are indexing a lot of documents 24/7 and we would like to get all the indexing speed with can (we have done all the other things to improve indexing speed, configuration, hw, mappings optimization, etc, etc.). Any suggestion?