I am calling for help as we are struggling with a strange problem: often when indexing a bulk of data, thousands of tiny segments with only one doc in each are created. This brings the cluster to its knees by consuming all the CPU on the impacted nodes, often disrupting the service.
Here is the situation:
- We have 3 servers (16 cores/1.2TB SSD/128GB RAM each) with 3 instances of ES 1.7.1 each (24G RAM per instance)
- Our index holds 1.6 Billion records in 10 shards with 1 replica, totalizing 1.2TB on disk
- This makes about 700 segments in normal conditions
- It is queried at about 150-200 queries per second
- Every hour, a few millions of records are added or updated, in bulk mode, using 8 parallel connections. This takes between 5 and 20 min.
The problem arise every hour during the bulk indexing. During the first few minutes, a huge number of segments are created (I have seen up to 6000) that hold only one document each. After that, the ES instance holding those segments use so much CPU that it is almost unresponsive, which slows own the other instances on the same server and sometimes even disrupts the service.
After a few minutes of very heavy CPU usage, the segments are finally merged (the count goes down to a normal ~ 700) and everything goes back to normal.
Is this a bug ? It appears to me that a segment should rarely hold only one doc...
Do you have advice on what settings to tune to avoid this problem ? We have already tried different refresh intervals (-1, 1s, 10s, 30s) and different merge throttling throughputs (from unlimited down to 10MBps) but the problem still occurs almost every time.
Thanks for your wisdom !