I'm noticing random "slowness" when writing. e.g., while most of the time the write operation completed under 20ms, there are occasional write operation that took >1s.
my setup:
- 6 data nodes, 24G JVM heap. (there are 3 additional master nodes)
- there are 2 index with heavy write only, 1 index with heavy read+write, 1 index with heavy write+moderate read.
- index shard are spread out evenly across all nodes. 2 index have 12 primaries and 2 have 6 primaries. all have 1 replica.
this random "slowness" happen randomly on all write heavy index. each index roughly receive 5 write operation / second.
after enabling tracing on 'logger.org.elasticsearch.index" I found that when a node is refreshing a shard it "suspend" all write, even when writing on different index. e.g., suppose the node is currently refreshing shards of index_A, and write request coming in for index_B, the write request to index_B doesnt complete/return until that node finishes refreshing index_A.
this behaviour compounded by additional replica shard. since the write request cannot complete until it finishes writing the replica as well. thus if any of the nodes (1 primary and n replica), where the write operation is processed, is currently refreshing, the write request experience the "slowness".
is this the correct behaviour?
any suggestion on how to reduce the the duration or the frequency of this "slowness" issue is greatly appreciated.