Very slow search performance after upgrade to 7.16.1

I have noticed that search performance for my cluster has reduced drastically since upgrading to 7.16.1 (from 7.15).

As an example consider an index dns-2022.01.31. There are around 56M docs in this and bulk of those docs are for destination_ip:10.11.100.100.

A query like the following

{
	"size": 1000,
	"query": {
		"bool": {
			"filter": [{
					"range": {
						"@timestamp": {
							"from": "2022-01-31T04:54:51.852000Z",
							"to": "2022-01-31T13:06:18.873000Z",
							"include_lower": true,
							"include_upper": true,
							"boost": 1.0
						}
					}
				},
				{
					"terms": {
						"source_ip": ["10.15.54.82"],
					    "boost": 1.0
					}
				},
                {
					"terms": {
						"destination_ip": ["10.11.100.100"],
						"boost": 1.0
					}
				},
				{
					"terms": {
						"query": ["update.googleapis.com"],
						"boost": 1.0
					}
				},
				{
					"terms": {
						"sensor": ["ids10"],
						"boost": 1.0
					}
				}
			],
			"adjust_pure_negative": true,
			"boost": 1.0
		}
	}
}

takes around 300ms to return 20 results.

When I run the profiler, it shows PointInSet query against destination_ip is the most expensive query that is run & takes bulk of that time (99.5%) which is understandable as most of entries are for this destination_ip.

Here's what profiler shows

PointInSetQuery - 324.7ms - destination_ip:{10.11.100.100}
ConstantScoreQuery - 0.1ms - ConstantScore(query:update.googleapis.com)
ConstantScoreQuery - 0.0ms - ConstantScore(sensor:ids10)
IndexOrDocValuesQuery - 0.2ms - @timestamp:[1643604891852 TO 1643634378873]
PointInSetQuery - 0.1ms - source_ip:{10.15.54.82}

The bulk of time (100%) is spent in build_scorer

If I remove the destination_ip filter entirely, the query runs in <2ms because the top costly query is no longer run. Here's profiler output in that case

ConstantScoreQuery - 0.3ms - ConstantScore(query:update.googleapis.com)
ConstantScoreQuery - 0.2ms - ConstantScore(sensor:ids10)
IndexOrDocValuesQuery - 0.2ms - @timestamp:[1643604891852 TO 1643634378873]
PointInSetQuery - 0.1ms - source_ip:{10.15.54.82}

This seems counter-intuitive to me and I don't think this issue was there in previous version.

As I cannot remove destination_ip from the query, how can I ensure my query1 runs as fast as query2. Has the query optimizer or processor changed recently which may have brought up this "issue"?

Thanks,
Dheeraj

As I can't downgrade to 7.15.1, I am unable to post comparative query profile from that version. But looking at program performance, I have a suspicion that there has been a 10-30x slowdown in query processing. Please see this post for more background information which may help someone to reproduce the problem.

I have also observed kibana to be taking "slightly" longer in rendering saved searches (for discover) since the upgrade. Initially I dismissed this as misconception, but looking the query processing program performance, I am starting to feel the slowdown is real.

I had look into the Lucene changes between version 8.9.0 (Elasticsearch 7.15) and 8.10.1 (Elasticsearch 7.16) and there is nothing suspicious that might have cause this.

I am assuming the mapping is the same and data between the two version equivalent. One thing you can try is to run the slower query in a loop and try to capture the hot threads to see if we can see where we are spending most of the time.

Note: It is a bad idea to map a field as an IP if you are going to perform terms queries. It would be much more efficient to map the field as keyword (if you are no going to rum range queries on it).

Thanks for looking at this Ignacio

Yes the mapping is the same and data between two versions is broadly equivalent. It is a daily summary of DNS request/responses in organization and is mostly similar (atleast in count and destination and gross distribution of queries).

Thanks for tip about IP. These terms are also used for range queries (when using Kibana) but I guess I can copy the value to a separate keyword field and use that for searching. Although this mapping has been use since before ES-7.15 so it would only be an additional optimization.

Yes, that is what I would like to understand. The best thing is if you can capture the hot threads as I explained above. The profile of the query might be useful as well to understand where we are spending most of the time.

I re-ran the query. Correction - It returns 200+ hits (still a tiny percent)

The query profile is

"profile" : {
    "shards" : [
      {
        "id" : "[7VyrRwY7RsaNsG9Ku_4mHA][dns-2022.01.31][1]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "#@timestamp:[1643604891852 TO 1643634378873] #source_ip:{10.15.54.82} #destination_ip:{10.11.100.100} #ConstantScore(query:update.googleapis.com) #ConstantScore(sensor:ids10)",
                "time_in_nanos" : 275082934,
                "breakdown" : {
                  "set_min_competitive_score_count" : 0,
                  "match_count" : 101,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 0,
                  "next_doc" : 2170938,
                  "match" : 493532,
                  "next_doc_count" : 101,
                  "score_count" : 101,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 102298,
                  "advance_count" : 1,
                  "score" : 43834,
                  "build_scorer_count" : 2,
                  "create_weight" : 785962,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 271486370
                },
                "children" : [
                  {
                    "type" : "IndexOrDocValuesQuery",
                    "description" : "@timestamp:[1643604891852 TO 1643634378873]",
                    "time_in_nanos" : 539188,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 101,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 359860,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 61779,
                      "advance_count" : 102,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 5184,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 112365
                    }
                  },
                  {
                    "type" : "PointInSetQuery",
                    "description" : "source_ip:{10.15.54.82}",
                    "time_in_nanos" : 256816,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 45866,
                      "match" : 0,
                      "next_doc_count" : 101,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 90563,
                      "advance_count" : 189,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 1979,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 118408
                    }
                  },
                  {
                    "type" : "PointInSetQuery",
                    "description" : "destination_ip:{10.11.100.100}",
                    "time_in_nanos" : 267025297,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 64188,
                      "advance_count" : 102,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 1103,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 266960006
                    }
                  },
                  {
                    "type" : "ConstantScoreQuery",
                    "description" : "ConstantScore(query:update.googleapis.com)",
                    "time_in_nanos" : 860151,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 683555,
                      "advance_count" : 290,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 98812,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 77784
                    },
                    "children" : [
                      {
                        "type" : "TermQuery",
                        "description" : "query:update.googleapis.com",
                        "time_in_nanos" : 472118,
                        "breakdown" : {
                          "set_min_competitive_score_count" : 0,
                          "match_count" : 0,
                          "shallow_advance_count" : 0,
                          "set_min_competitive_score" : 0,
                          "next_doc" : 0,
                          "match" : 0,
                          "next_doc_count" : 0,
                          "score_count" : 0,
                          "compute_max_score_count" : 0,
                          "compute_max_score" : 0,
                          "advance" : 390704,
                          "advance_count" : 290,
                          "score" : 0,
                          "build_scorer_count" : 3,
                          "create_weight" : 8992,
                          "shallow_advance" : 0,
                          "create_weight_count" : 1,
                          "build_scorer" : 72422
                        }
                      }
                    ]
                  },
                  {
                    "type" : "ConstantScoreQuery",
                    "description" : "ConstantScore(sensor:ids10)",
                    "time_in_nanos" : 661630,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 636967,
                      "advance_count" : 102,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 9263,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 15400
                    },
                    "children" : [
                      {
                        "type" : "TermQuery",
                        "description" : "sensor:ids10",
                        "time_in_nanos" : 540330,
                        "breakdown" : {
                          "set_min_competitive_score_count" : 0,
                          "match_count" : 0,
                          "shallow_advance_count" : 0,
                          "set_min_competitive_score" : 0,
                          "next_doc" : 0,
                          "match" : 0,
                          "next_doc_count" : 0,
                          "score_count" : 0,
                          "compute_max_score_count" : 0,
                          "compute_max_score" : 0,
                          "advance" : 526480,
                          "advance_count" : 102,
                          "score" : 0,
                          "build_scorer_count" : 3,
                          "create_weight" : 2442,
                          "shallow_advance" : 0,
                          "create_weight_count" : 1,
                          "build_scorer" : 11408
                        }
                      }
                    ]
                  }
                ]
              }
            ],
            "rewrite_time" : 91333,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 502790
              }
            ]
          }
        ],
        "aggregations" : [ ],
        "fetch" : {
          "type" : "fetch",
          "description" : "",
          "time_in_nanos" : 25039205,
          "breakdown" : {
            "load_stored_fields" : 7341896,
            "load_stored_fields_count" : 101,
            "next_reader" : 18375,
            "next_reader_count" : 1
          },
          "debug" : {
            "stored_fields" : [
              "_id",
              "_routing",
              "_source"
            ]
          },
          "children" : [
            {
              "type" : "FetchSourcePhase",
              "description" : "",
              "time_in_nanos" : 58257,
              "breakdown" : {
                "process_count" : 101,
                "process" : 54538,
                "next_reader" : 3719,
                "next_reader_count" : 1
              },
              "debug" : {
                "fast_path" : 101
              }
            }
          ]
        }
      },
      {
        "id" : "[gIa422qtRymC2yCgUX0BRg][dns-2022.01.31][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "#@timestamp:[1643604891852 TO 1643634378873] #source_ip:{10.15.54.82} #destination_ip:{10.11.100.100} #ConstantScore(query:update.googleapis.com) #ConstantScore(sensor:ids10)",
                "time_in_nanos" : 260444552,
                "breakdown" : {
                  "set_min_competitive_score_count" : 0,
                  "match_count" : 123,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 0,
                  "next_doc" : 1808261,
                  "match" : 305415,
                  "next_doc_count" : 123,
                  "score_count" : 123,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 74258,
                  "advance_count" : 1,
                  "score" : 39799,
                  "build_scorer_count" : 2,
                  "create_weight" : 676729,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 257540090
                },
                "children" : [
                  {
                    "type" : "IndexOrDocValuesQuery",
                    "description" : "@timestamp:[1643604891852 TO 1643634378873]",
                    "time_in_nanos" : 321341,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 123,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 198400,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 38862,
                      "advance_count" : 124,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 4536,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 79543
                    }
                  },
                  {
                    "type" : "PointInSetQuery",
                    "description" : "source_ip:{10.15.54.82}",
                    "time_in_nanos" : 219884,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 43394,
                      "match" : 0,
                      "next_doc_count" : 123,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 62331,
                      "advance_count" : 215,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 1492,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 112667
                    }
                  },
                  {
                    "type" : "PointInSetQuery",
                    "description" : "destination_ip:{10.11.100.100}",
                    "time_in_nanos" : 254418599,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 45270,
                      "advance_count" : 124,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 547,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 254372782
                    }
                  },
                  {
                    "type" : "ConstantScoreQuery",
                    "description" : "ConstantScore(query:update.googleapis.com)",
                    "time_in_nanos" : 813078,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 691533,
                      "advance_count" : 338,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 56703,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 64842
                    },
                    "children" : [
                      {
                        "type" : "TermQuery",
                        "description" : "query:update.googleapis.com",
                        "time_in_nanos" : 471987,
                        "breakdown" : {
                          "set_min_competitive_score_count" : 0,
                          "match_count" : 0,
                          "shallow_advance_count" : 0,
                          "set_min_competitive_score" : 0,
                          "next_doc" : 0,
                          "match" : 0,
                          "next_doc_count" : 0,
                          "score_count" : 0,
                          "compute_max_score_count" : 0,
                          "compute_max_score" : 0,
                          "advance" : 402021,
                          "advance_count" : 338,
                          "score" : 0,
                          "build_scorer_count" : 3,
                          "create_weight" : 9690,
                          "shallow_advance" : 0,
                          "create_weight_count" : 1,
                          "build_scorer" : 60276
                        }
                      }
                    ]
                  },
                  {
                    "type" : "ConstantScoreQuery",
                    "description" : "ConstantScore(sensor:ids10)",
                    "time_in_nanos" : 385064,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 361817,
                      "advance_count" : 124,
                      "score" : 0,
                      "build_scorer_count" : 3,
                      "create_weight" : 9280,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 13967
                    },
                    "children" : [
                      {
                        "type" : "TermQuery",
                        "description" : "sensor:ids10",
                        "time_in_nanos" : 309240,
                        "breakdown" : {
                          "set_min_competitive_score_count" : 0,
                          "match_count" : 0,
                          "shallow_advance_count" : 0,
                          "set_min_competitive_score" : 0,
                          "next_doc" : 0,
                          "match" : 0,
                          "next_doc_count" : 0,
                          "score_count" : 0,
                          "compute_max_score_count" : 0,
                          "compute_max_score" : 0,
                          "advance" : 296305,
                          "advance_count" : 124,
                          "score" : 0,
                          "build_scorer_count" : 3,
                          "create_weight" : 2716,
                          "shallow_advance" : 0,
                          "create_weight_count" : 1,
                          "build_scorer" : 10219
                        }
                      }
                    ]
                  }
                ]
              }
            ],
            "rewrite_time" : 103901,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 419100
              }
            ]
          }
        ],
        "aggregations" : [ ],
        "fetch" : {
          "type" : "fetch",
          "description" : "",
          "time_in_nanos" : 11602835,
          "breakdown" : {
            "load_stored_fields" : 7225743,
            "load_stored_fields_count" : 123,
            "next_reader" : 27007,
            "next_reader_count" : 1
          },
          "debug" : {
            "stored_fields" : [
              "_id",
              "_routing",
              "_source"
            ]
          },
          "children" : [
            {
              "type" : "FetchSourcePhase",
              "description" : "",
              "time_in_nanos" : 82529,
              "breakdown" : {
                "process_count" : 123,
                "process" : 78512,
                "next_reader" : 4017,
                "next_reader_count" : 1
              },
              "debug" : {
                "fast_path" : 123
              }
            }
          ]
        }
      }
    ]
  }

OK I tried running the query in a loop. It consistently took 0.3-0.5 seconds (never lower) and I tried to capture hot threads for one of the nodes. I stopped existing search programs so hopefully data does correspond to the same search

::: {esnode2}{7VyrRwY7RsaNsG9Ku_4mHA}{Bg1ZBUWGQNila-hAPL7FBQ}{10.44.0.47}{10.44.0.47:9201}{dirt}{xpack.installed=true, transform.node=true}
   Hot threads at 2022-02-08T09:45:55.681Z, interval=100ms, busiestThreads=3, ignoreIdleThreads=true:
   
   100.7% [cpu=100.7%, other=0.0%] (100.7ms out of 100ms) cpu usage by thread 'elasticsearch[esnode2][search][T#3]'
     2/10 snapshots sharing following 35 elements
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:674)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:674)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:362)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitablePointValues.intersect(ExitableDirectoryReader.java:192)
       app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:154)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:148)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:750)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:158)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:182)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:330)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:244)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:191)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:458)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$5947/0x000000080196cb08.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$5948/0x000000080196cd30.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5949/0x000000080196cf58.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.1/java.lang.Thread.run(Thread.java:833)
     8/10 snapshots sharing following 13 elements
       java.base@17.0.1/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@17.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
       java.base@17.0.1/java.util.concurrent.LinkedTransferQueue$Node.block(LinkedTransferQueue.java:470)
       java.base@17.0.1/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3463)
       java.base@17.0.1/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3434)
       java.base@17.0.1/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:669)
       java.base@17.0.1/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
       java.base@17.0.1/java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1286)
       app//org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:152)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1062)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.1/java.lang.Thread.run(Thread.java:833)

::: {esnode2}{7VyrRwY7RsaNsG9Ku_4mHA}{Bg1ZBUWGQNila-hAPL7FBQ}{10.44.0.47}{10.44.0.47:9201}{dirt}{xpack.installed=true, transform.node=true}
   Hot threads at 2022-02-08T09:45:56.398Z, interval=100ms, busiestThreads=3, ignoreIdleThreads=true:
      
   100.0% [cpu=97.7%, other=2.3%] (100ms out of 100ms) cpu usage by thread 'elasticsearch[esnode2][search][T#7]'
     2/10 snapshots sharing following 38 elements
       app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:395)
       app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:395)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:629)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:674)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:362)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitablePointValues.intersect(ExitableDirectoryReader.java:192)
       app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:154)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:148)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:750)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:158)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:182)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:330)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:244)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:191)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:458)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$5947/0x000000080196cb08.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$5948/0x000000080196cd30.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5949/0x000000080196cf58.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.1/java.lang.Thread.run(Thread.java:833)
     3/10 snapshots sharing following 37 elements
       app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:391)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:629)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:674)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:362)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitablePointValues.intersect(ExitableDirectoryReader.java:192)
       app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:154)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:148)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:750)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:158)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:182)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:330)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:244)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:191)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:458)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$5947/0x000000080196cb08.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$5948/0x000000080196cd30.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5949/0x000000080196cf58.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.1/java.lang.Thread.run(Thread.java:833)
     2/10 snapshots sharing following 38 elements
       app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:391)
       app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:395)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:629)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:674)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:362)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitablePointValues.intersect(ExitableDirectoryReader.java:192)
       app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:154)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:148)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:750)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:158)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:182)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:330)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:244)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:191)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:458)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$5947/0x000000080196cb08.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$5948/0x000000080196cd30.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5949/0x000000080196cf58.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.1/java.lang.Thread.run(Thread.java:833)
     3/10 snapshots sharing following 34 elements
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:674)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:664)
       app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:362)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitablePointValues.intersect(ExitableDirectoryReader.java:192)
       app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:154)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:148)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:750)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:158)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:182)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:330)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:244)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:191)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:458)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$5947/0x000000080196cb08.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$5948/0x000000080196cd30.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5949/0x000000080196cf58.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.1/java.lang.Thread.run(Thread.java:833)

Thanks again

I don't see anything strange in the profile or in the hot threads. What we see is consistent and we are just spending a lot of time visiting the BKD tree. There has been no work on that area so we think it is something more subtle that is exposing a pre-existing performance bug.

The change in Lucene that is exposing the situation is this one [LUCENE-10043] Decrease default for LRUQueryCache#skipCacheFactor? - ASF JIRA where we are less aggressive caching small segments.

There is an undocumented setting that we use for testing that can help confirming our suspicion. Could you set the setting indices.queries.cache.all_segments to true? Just to be clear that we recommend unsetting it once it is confirmed the issue, you should not run this setting in a production environment.

Let us know the result.

Thanks for taking time to investigate this

Is this setting to be done on per node basis using elasticsearch.yml? Should I set this on all nodes, because once I restart one node, it is not guaranteed that it will retain shards I am interested in?

I am afraid is a static setting so it is per node basis :frowning:

Anyway, I have already open an issue in the Elasticsearch repo that should help in this case: https://github.com/elastic/elasticsearch/issues/83658.

Ah in that case I won't be able to tinker with production setup.

I have just discovered that the Elasticsearch was updated at December end and this issue I am facing is more recent. There have been no changes in infrastructure in that time although we have had frequent weekend shutdowns. Thus, it is possible that the search was always at same speed and my program could make up the lag on weekends. Now with the systems off on alternate weekends, it is unable to make up the lag.

To test my theory, I have prepared identical single node cluster for both 7.15 and 7.16 and am dumping this index to both the clusters. Once complete, I will run the same search to can get a relative comparison of search speeds.

If, as I am now expecting, there is not much difference, then this is not an Elasticsearch issue and I apologize for wasting everybody's time.

I will be looking forward to hearing your conclusions.

It was not a waste time, I think the issue I open still stands and there is room of improvement here :slight_smile:

I dumped the index into two identical single node clusters (Both VMs with 32GB RAM and 4 cores with identical memory settings for Elasticsearch and hosted on same physical machine and using same physical network port).

Turns out queries against 7.16.1 are actually faster by a margin of about 25% when compared to 7.15.1.

ES-7.15.1 - Average query time - 0.271 seconds
ES-7.16.1 - Average query time - 0.201 seconds

So this issue is something wrong at my end and not related to Elasticsearch at all.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.