Search performance is generally a combination of the search latency achieved and the number of concurrent queries the cluster can support. The latency will depend on e.g. the size of the shards, the type of data and queries used as well as resource constraints. As the number of concurrent queries will increase resource usage, these are linked.
If you have a node with a primary shard (or two nodes each with a primary shard as in your scenario) you can run a benchmark with a single concurrent query and measure the latency the node(s) can support for different types of queries. Once you have established a latency threshold you are willing to accept, you can gradually increase the number of concurrent queries until you see that the latency the cluster can support increases beyond the target. Lets assume you are able to serve X concurrent queries within the expected latency. If you continue to increase the number of concurrent queries, work will be queued up and latency increase further, but the cluster can still continue to serve queries. At some point you will start to see queries getting rejected as they can no longer be queued up.
In order to serve more concurrent queries you can then either scale up the node by adding resources, which alters the capacity of the node and changes the calculation above, or scale out by adding more nodes and start using replica shards. If you add one replica shard on another node (or 2 nodes in your scenario) the latency would not be affected but the cluster should be able to serve close to 2X concurrent queries.
If you want to serve more concurrent queries at the target latency you can continue to scale out or improve the query throughput a single node can handle, e.g. by ensuring the nodes page cache is large enough to hold the full data set, thereby eliminating/reducing disk I/O as a resource constraint.
So, are you suggesting 3-4 replica nodes for the same primary node after the latency is not expected?
To think on broader scale, would search platforms like google and bing would follow such strategy or there are other smarter ways to handle this than redundant data copies (replica nodes) or vertical scaling which i feel is usually not the answer when you know that search hits would increase for sure.
Commercial search engines like Google and Bing are large, complex and highly optimised systems, and I do not think you can compare that to how you scale Elasticsearch.
There is a lot that go into optimizing search performance in Elasticsearch and the example you covered only covers one aspect of this. I would recommend having a look at this section in the docs. To get the best performance from Elasticsearch I would recommend looking into the following:
Optimise how data is stored. This includes document size and structure as well as mappings used and shard size.
Optmise how you query data. This includes writing as efficient queries as possible as well as targeting data efficiently. This may require restructuring the data to better support certain types of queries.
Ensure that cluster resources are used as efficiently as possible. Identify and address bottlenecks. Does it make more sense to scale up or out?