As @Christian_Dahlqvist mentioned, typically it's IO that takes a bigger hit than cpu for ingest and search.
With ingestion, when you are certain that the load driver isn't the bottleneck, apart from tweaking the number of clients you can experiment with increasing the
bulk_size (which for the geopoint track you are currently using is 5000 by default). Higher throughput (and load) can be achieved by increasing the number of shards; we've recently had a 3 node setup with another track and trying to explain the modest indexing throughput we were able to get higher results by increasing the number of primary shards.
Note that the geopoint track you are using has a rather small corpus, from the default tracks with pre-generated data you could try the nyc_taxis track that has a large corpus and uses a bigger bulk size by default.
Finally, depending on the use case, the default compression can be changed to best_compression (see index.codec setting); this is for example what the eventdata track uses. Best compression will also increase cpu usage and can be attractive for logging use cases where disk costs need to be kept under control.
Many of the common pitfalls and topics touched here are discussed in the links I mentioned in my previous comment.