Hmm your hot threads output on that "still busy even after stopping bulk indexing" node is unhealthy, with nearly all threads doing this:
97.7% (976.8ms out of 1s) cpu usage by thread 'elasticsearch[test_data_11-d2][bulk][T#7]'
10/10 snapshots sharing following 22 elements
java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(Unknown Source)
java.lang.ThreadLocal$ThreadLocalMap.remove(Unknown Source)
java.lang.ThreadLocal$ThreadLocalMap.access$200(Unknown Source)
java.lang.ThreadLocal.remove(Unknown Source)
java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(Unknown Source)
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(Unknown Source)
org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:365)
org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:531)
org.elasticsearch.index.engine.Engine$Create.execute(Engine.java:810)
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:476)
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:69)
It's as if you have waaay too many ThreadLocal
instances and these threads are stuck walking through all of them pruning the now unreferenced ones.
Which Java version are you using?
Do you still have any settings increasing e.g. thread pool sizes?