Scaling ES Cluster and balacing shards (primary, replica)

Hi guys, I'm working on how to organize our indices and how many servers do we need in production.
I did a big measuring around the following circumstances:

  • I have a 6 node test cluster: 1 dedicated client node, 1 dedicated master node and 4 dedicated data nodes.
  • Client: 2 CPU core, 4 GB RAM | Master: 4 CPU core, 8 GB RAM | Data nodes: 4 CPU core, 16 GB RAM

The estimated count of our production data is around 5 million documents in a year (it may increase in the future).
I was trying to index all the documents but it was quite slow because of the amount of the data so I started my performance tests with a monthly amount of data which is around 300000 documents from two types (230000 and 70000 per type).

First I used just 3 of my all data nodes to have the ability to scale on numbers of nodes.
I've set up different shard configurations (shard-replica): 1-0, 1-1, 1-2, 2-0, 2-1, 2-2 ... 6-0, 6-1, 6-2 (18 different configurations).
I was testing indexing and searching time both - first i was indexing the monthly data for a shard configuration, then I started to query them from JMeter with 7000 different keyword set (real production data) and with different parallellistics (50, 100, 200, 500, 1000 threads).
I was measuring the average bulk indexing time with a specific bulk size, and the average JMeter HTTP response time from a query.

What I found is:

  • increasing number of primary shards are drastically increases the speed of indexing,
  • increasing number of replica shards are barely decreasing the speed of indexing.

The strange point is in the querying side:

  • increasing number of primary shards are drastically decreases the speed of querying,
  • increasing number of replica shards are decreasing the speed of querying but not as drastically as opposite.

I was trying some extremist configurations with 6 dedicated data nodes: 1 primary 6 replicas, 2 primaries 6 replica, 6 primaries 1 replica.
The most suprising thing was that response time for 2-6 was three times bigger than 1-6 which I don't understand.

But it shows me that speed of indexing and querying is not just depends on the number of shards but also on the number of nodes.
So I repeated my whole test with 4 data nodes - which impacted only the number of replica nodes (I was able to set up 3 replica shards for an index).
Result was the same.

Long question short: is it a real result that for my data the best configuration is 2 primary shards and 1 replicas for each index (for 3 data nodes)?