I am trying to understand behavior I am seeing from Elasticsearch 6.8 on a production site. During a period of heavy use (indexing large numbers of documents), Prometheus metrics are showing me that ES is using large amounts of CPU and a steadily increasing amount of disk space, but memory usage is only slightly elevated over the norm. In the ES logs, I see many messages like this:
[2023-10-20T15:15:01,932][INFO ][o.e.i.IndexingMemoryController] [elasticsearch-master-0] now throttling indexing for shard [[data_explorer][1]]: segment writing can't keep up
This continues until disk usage hits 95%, at which point the index is set to read-only / allow delete mode as described here and the remaining index requests fail. Disk usage then falls to a level slightly below that from before the indexing started. There are no error messages in the process that is invoking ES until the index is set to read-only mode. After this, the index has to be manually reset out of read-only mode.
I have three main questions:
- Why does disk usage drop back to its original level after flipping to read-only mode?
- Why does the read-only setting on the index have to be reset manually if the doc linked above says "the index block is automatically released when the disk utilization falls below the high watermark"?
- What does the throttling seen in the ES logs mean, and what is causing it? Do the ES docs describe this process anywhere? (I have not been successful in finding an official explanation)