Latency spike after big merge

I'm trying to figure out why in the below use case there is a latency spike after a big merge - one that creates an 8M docs segment. What I'm observing is that immediately after the merge, queries that used to take 4ms in the 99th percentile, suddenly take 1000ms, and then settle back to their usual latency.

Details:

  1. ES version 5.6.3.
  2. Index data files reside in tmpfs.
  3. Swapping disabled for the ES instance (verified using _nodes?filter_path=**.mlockall)
  4. No long GC pauses at the time of the spike.
  5. No use of field data - all relevant fields have doc values, verified using
    //_cat/fielddata?v
  6. Not using term aggregation queries, so not using global ordinals.
  7. 9M docs, 6GB index on a 128GB RAM machine.
  8. Single shard.

The below query uses only range queries on numeric fields and term(s) queries.
It's my understanding that since ES 5.4, due to the improved query planning for range queries, the below query should be fast regardless of whether the range queries are already cached, as it involves a conjunction with a terms query on _id that matches 7 documents.

So my question is, assuming the issue is indeed with Elastic Search / Lucene and not some external effect, what can cause this query to be slow on a cold segment?

These are the mappings:

          "Exclusions_Int": {
            "type": "integer",
            "store": true
          },
          "Language_String_Unanalysed": {
            "type": "keyword",
            "store": true
          },
          "LifespanExpirationTimestamp_DateTime": {
            "type": "date",
            "store": true,
            "format": "dateOptionalTime"
          },
          "SourceId_Long": {
            "type": "long",
            "store": true
          },
          "TitleLength_Int_Ranged": {
            "type": "integer",
            "store": true
          },
          "Valid_Boolean_IndexedOnly": {
            "type": "boolean"
          },
          "WhiteLists_Int": {
            "type": "integer",
            "store": true
          }

Some of the mappings (such as WhiteLists_Int) are better replaced with keyword.

The query:

"query": {
  "bool": {
    "filter": [
      {
        "bool": {
          "must": [
            {
              "terms": {
                "_id": [
                  1699340119,
                  1496791369,
                  1731738765,
                  1903196112,
                  1907088712,
                  1919973438,
                  1907074472
                ],
                "boost": 1.0
              }
            },
            {
              "bool": {
                "must_not": [
                  {
                    "term": {
                      "SourceId_Long": {
                        "value": 2637283,
                        "boost": 1.0
                      }
                    }
                  }
                ],
                "should": [
                  {
                    "term": {
                      "WhiteLists_Int": {
                        "value": 11619,
                        "boost": 1.0
                      }
                    }
                  }
                ],
                "disable_coord": false,
                "adjust_pure_negative": true,
                "boost": 1.0
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "LifespanExpirationTimestamp_DateTime": {
                        "from": "now/d",
                        "to": null,
                        "include_lower": true,
                        "include_upper": true,
                        "boost": 1.0
                      }
                    }
                  },
                  {
                    "bool": {
                      "must_not": [
                        {
                          "exists": {
                            "field": "LifespanExpirationTimestamp_DateTime",
                            "boost": 1.0
                          }
                        }
                      ],
                      "disable_coord": false,
                      "adjust_pure_negative": true,
                      "boost": 1.0
                    }
                  }
                ],
                "disable_coord": false,
                "adjust_pure_negative": true,
                "boost": 1.0
              }
            },
            {
              "terms": {
                "Valid_Boolean_IndexedOnly": [
                  true
                ],
                "boost": 1.0
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "term": {
                      "SourceId_Long": {
                        "value": 2637283,
                        "boost": 1.0
                      }
                    }
                  },
                  {
                    "term": {
                      "Language_String_Unanalysed": {
                        "value": "de",
                        "boost": 1.0
                      }
                    }
                  }
                ],
                "disable_coord": false,
                "adjust_pure_negative": true,
                "boost": 1.0
              }
            },
            {
              "range": {
                "TitleLength_Int_Ranged": {
                  "from": null,
                  "to": 150,
                  "include_lower": true,
                  "include_upper": true,
                  "boost": 1.0
                }
              }
            }
          ],
          "must_not": [
            {
              "terms": {
                "Exclusions_Int": [
                  1327
                ],
                "boost": 1.0
              }
            }
          ],
          "disable_coord": false,
          "adjust_pure_negative": true,
          "boost": 1.0
        }
      }
    ],
    "disable_coord": false,
    "adjust_pure_negative": true,
    "boost": 1.0
  }
}

I believe I found the cause for the latency spike, see LUCENE-8213
Indeed, when I disabled caching using

"index.queries.cache.enabled": "false"

the issue disappeared.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.