Indexing performance degrading over time

Hmm your hot threads output on that "still busy even after stopping bulk indexing" node is unhealthy, with nearly all threads doing this:

  97.7% (976.8ms out of 1s) cpu usage by thread 'elasticsearch[test_data_11-d2][bulk][T#7]'
     10/10 snapshots sharing following 22 elements
       java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(Unknown Source)
       java.lang.ThreadLocal$ThreadLocalMap.remove(Unknown Source)
       java.lang.ThreadLocal$ThreadLocalMap.access$200(Unknown Source)
       java.lang.ThreadLocal.remove(Unknown Source)
       java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(Unknown Source)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(Unknown Source)
       java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(Unknown Source)
       org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
       org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:365)
       org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:531)
       org.elasticsearch.index.engine.Engine$Create.execute(Engine.java:810)
       org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:476)
       org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:69)

It's as if you have waaay too many ThreadLocal instances and these threads are stuck walking through all of them pruning the now unreferenced ones.

Which Java version are you using?

Do you still have any settings increasing e.g. thread pool sizes?