Check performance of cluster

I saw that you found my comments on a very similar thread. Let me provide some further details based on the information you provided.

Elasticsearch will use the heap assigned, but having more heap assigned than necessary will not necessarily improve performance. In addition to the heap Elasticsearch also stores some data off-heap. The rest of the memory is used by the operating system page cache to cache frequently accessed files and as I pointed out in the linked thread it is important to make sure your full data set fits in the page cache for high query concurrency use cases in order to avoid disk I/O to the greatest extent possible. The JVM heap graphs indicate that you can reduce the heap size, so that is what I would do. Try lowering it from 21GB to e.g. 14GB and see if that makes any difference. Also make sure that Elasticsearch has access to all the memory it has configured. You do not want to overprovision and risk having parts of the memory swapped out to disk.

The minimum query latency you can achieve will depend on the shard size, the data and the queries run. If possible I would recommend trying to reduce the primary shard count to 1 so a single shard can serve a query all by itself.

You seem to have a very low level of CPU configured. If you are able to support 400 concurrent queries with this your queries must be very simple. Elasticsearch sizes threadpools based on how many CPU cores are available so I would recommend increasing this and allocate a full 8 CPU cores to each data node. I would bump the CPU allocation of the master nodes as well. You do not want them to be starved of CPU when they actually need it. As with memory, make sure Elasticsearch has access to all the CPU resources that are configured and do not overprovision.