Thank you for answering with the output requested, not everyone does so and it is appreciated.
It's not the topic you asked about, but you have not improved your fault tolerance. Before, when you had one node, you had no fault tolerance - if that node were unavailable then you had a problem. That is still the case, that node remains a SPOF, it's the only master-eligible node. A 2-node cluster, were you set both nodes as master-eligible, is not actually any better either, at least not in that sense. And I'm not clear what data has been replicated either ...
You said you set replicas to 1, but did you do so for all 8000+ indices, or just a few ?
Because, as assumption that all indices are 1 primary shard, your output suggests very few indices have any replica shards
"active_primary_shards": 8456,
"active_shards": 8494 <-- this should be roughly be 2x the number above if all shards are replicated
In the cluster health output, I notice 8 relocating_shards. Has this ever reached zero?
I'm wondering what is really going on here, has your cluster just "balanced" you 8000+ indices across the 2 nodes, and that balancing is still ongoing maybe, and thus sucking a lot of bandwidth between the 2 nodes. Whats the connectivity between the 2 nodes, in terms of available bandwidth and latency?
Do a GET on
/_cat/shards?h=p
and pipe output into sort | uniq -c
In my tiny 3-node cluster I get:
$ escurl "/_cat/shards?h=p" | sort | uniq -c
6 p
6 r
i.e. I have 6 primary shards and 6 replica shards.
Also
$ escurl "/_cat/shards?h=node" | sort | uniq -c
4 node1
4 node2
4 node3
so those 12 total shards are distributed 4/4/4 across my 3 nodes.
I'd be curious to your outputs for the same.
(escurl is just a shell alias to the curl command with relevant arguments)