ES creating thousands of segments with 1 document each

Hello Community,

I am calling for help as we are struggling with a strange problem: often when indexing a bulk of data, thousands of tiny segments with only one doc in each are created. This brings the cluster to its knees by consuming all the CPU on the impacted nodes, often disrupting the service.

Here is the situation:

  • We have 3 servers (16 cores/1.2TB SSD/128GB RAM each) with 3 instances of ES 1.7.1 each (24G RAM per instance)
  • Our index holds 1.6 Billion records in 10 shards with 1 replica, totalizing 1.2TB on disk
  • This makes about 700 segments in normal conditions
  • It is queried at about 150-200 queries per second
  • Every hour, a few millions of records are added or updated, in bulk mode, using 8 parallel connections. This takes between 5 and 20 min.

The problem arise every hour during the bulk indexing. During the first few minutes, a huge number of segments are created (I have seen up to 6000) that hold only one document each. After that, the ES instance holding those segments use so much CPU that it is almost unresponsive, which slows own the other instances on the same server and sometimes even disrupts the service.

After a few minutes of very heavy CPU usage, the segments are finally merged (the count goes down to a normal ~ 700) and everything goes back to normal.

Is this a bug ? It appears to me that a segment should rarely hold only one doc...
Do you have advice on what settings to tune to avoid this problem ? We have already tried different refresh intervals (-1, 1s, 10s, 30s) and different merge throttling throughputs (from unlimited down to 10MBps) but the problem still occurs almost every time.

Thanks for your wisdom !

Hervé BRY

That sounds odd. Can you check the index settings? Have by any chance set index.translog.flush_threshold_ops or index.translog.flush_threshold_size to an inappropriate value?

In addition to @Christian_Dahlqvist's ideas, have a look at this:

It might be what you are seeing. Maybe.

You can test this by pushing a single document into the index before you do the bulk index and then waiting 45 seconds or so and then doing the bulk load.

Thanks for your suggestions.

@Christian_Dahlqvist: there are no specific translog options set for the index. Here is the config we use :

@nik9000: We are going to try your suggestion. It indeed seems like it can be the cause of our problem. Is there any chance this PR might be merged in ES 1.x ?

I doubt it. Its pretty deep in the 2.0 line so it'd be quite difficult.