Huge performance degradation during bulk indexing

Yes, that is indeed interesting. I still suspect I/O, although it may not be your I/O. Still, it is also interesting that these spikes seem to be correlated across nodes.

Not particularly, I just wanted to check that you'd done some measurements here. It seems that you have.

Ah ok, you're using AWS Elasticsearch, hence the difficulty in getting iowait stats. r4.xlarge instances only have EBS volumes - can you see any metrics about wait times/bandwidth/burst credits for these volumes that correlate with your latency spikes? Can you try using local SSDs (e.g. upgrade to r5d.xlarge) to rule out EBS as a source of these issues?