Some more things to consider.
It seems you deliberately test on a single machine, partitioning a huge hardware into nodes.
While that is surely a possibility which is close at hand, it is not the preferred method how Elasticsearch is designed to scale. A single ES node can utilize all cores of a machine, and every shard it creates can allocate all resources. The main factor of limitation when scaling is the ability of the JVM to handle large memory and today's JVM architecture works best at ~8 GB heap. Larger heaps increase GC overhead. More than one JVM per machine comes with a slight penalty, because each JVM competes with the other, and Java threads must be mapped to the OS native thread model, also GC overhead adds up.
The preferred method is to add nodes when resources get low, that is, ES scales horizontally, not vertically.
That said, I suspect the "bottleneck" you have observed is also inherent to the choice of an architecture where the components are squeezed onto a single machine. Beside that it's really hard to deduce measurements for the capability of a single node.
My rules of thumb are:
- small hardware, many servers: at least 3 machines in a cluster (1 or 2 do not form a true distributed system)
- every data node must be busy indexing all the time, so there must be the same number of shards per machine, for each index. This ensures that bulk indexing can equally distribute the load over the nodes. ES shard allocation helps, by default it tries to balance the shard numbers
- separation of concerns: client nodes should be remote, not situated on a machine with data node. This separates workload of data ingestion from index building (segment merging etc.)
- no replica while bulk mode is ongoing, disabling refresh (and all the recommended bulk setup points)
- if the client node(s) can not establish the required throughput, add next client node, or data node, and so on
The concurrency rate of 24 is quite high. It depends on the client node CPU core count and the cluster capability to answer. The larger the cluster node count, the faster the bulk responses. Maybe you have 24 cores idle at client node, then it should be ok. There is a tradeoff between large bulk requests and high bulk concurrency rate. The larger a bulk request, the more heap is used, and the longer a request/response takes.
Monitoring CPU (system load), disk I/O, and network traffic is essential and give a good idea what is going on between client nodes and data nodes.
By volume dimensions, I mean the calculation for matching the amount of data you want to index per time frame to the available hardware resources. It depends on the size of the data input, the interval/frequency of new data, but also the power of the cluster (throughput) and the reserve you might want to allocate for peak times. Without that calculation, you don't know the number of nodes required.
By far, disk I/O is the slowest component of an ES cluster, and first candidate of being a "bottleneck" when it comes to bulk indexing.