Forcemerge into one segment - bad for Sorting performance on doc_value fields?

rkrombho · March 30, 2016, 1:03pm

Hi All,

I have a number of fairly large time-based indices which we create every week.
Now this piece of documentation in the Guide
https://www.elastic.co/guide/en/elasticsearch/guide/current/merge-process.html#optimize-api
gave me the impression that having less number of Segments would give me better performance in any case.

So we always force a segment merge down to one single segment after we are done with the weekly indexing.

A weekly index is roughly 2,000,000,000 documents with overall ~500GB spread across two shards (with 1 replica) on overall 4 nodes.

What I recognized with our large indices is that something seem to hurt sort performance a lot (the sorts always goes on doc_value fields).
Also what I noticed is that the results of sorted queries do not benefit that much from caching.
Subsequent executions of sorted simple filter queries still run ~1.5 - 2 seconds.

Here are the details on my two segments of the primary shards:

"_38h" : {
  "generation" : 4193,
  "num_docs" : 995808448,
  "deleted_docs" : 0,
  "size_in_bytes" : 253814717556,
  "memory_in_bytes" : 752220123,
  "committed" : true,
  "search" : true,
  "version" : "5.4.1",
  "compound" : false
}
...
"_363" : {
    "generation" : 4107,
    "num_docs" : 995768798,
    "deleted_docs" : 0,
    "size_in_bytes" : 253690281166,
    "memory_in_bytes" : 749432404,
    "committed" : true,
    "search" : true,
    "version" : "5.4.1",
    "compound" : false
}

Thats a segments size of ~236GB. Is there a point where too large Segments have negative impacts on sorts (or IO Buffering in general)?
My naive theory is that those segments are way to large for the OS to do efficient IO buffering and because we use doc values we are hitting the disks way to often (and also for subsequent re-execution of the same query)

Does someone have experience or know about negative impacts of too large segments?

Cheers
Robert

jpountz · March 31, 2016, 8:45am

If your query is just a match_all sorted by a date field, then search performance should be about the same regardless of the number of segments. I don't think the OS buffering would perform worse on a large files. Merging down to fewer segments mostly helps with queries that are terms-dictionary intensive like range or prefix queries, and things that need global ordinals like parent/child queries and terms aggregations.

Topic		Replies	Views
Optimizing segment merging Elasticsearch	1	538	March 12, 2021
Forcemerge?max_num_segments=1 is having any side affect to es engine Elasticsearch elastic-stack-monitoring	2	616	July 25, 2020
Bulk indexing: single segment per shard Elasticsearch	4	700	February 26, 2019
How can i reduce amount of segments Elasticsearch	15	1436	July 5, 2017
Elastic 5.6.5 - Force Merge how many segments need to configure Elasticsearch	9	889	August 5, 2019

Forcemerge into one segment - bad for Sorting performance on doc_value fields?

Related topics