IDs Query Performance Problems

yottabyte · May 13, 2024, 3:10pm

Hello,

I have documents with the same ids (guid) split across indices in my system. When a search is run across indices, I join them in my client code. This is done using IDs queries: IDs | Elasticsearch Guide [8.13] | Elastic

The indices are millions of documents at times, sometimes over 50gb a pop, but there are 5 shards per index and multiple nodes and 31gb of ram dedicated to the jvms.

The ID queries are sometimes hundred of thousands of guid ids.

Using the query profiler in Kibana, I found the build_scorer in TermInSetQuery was 99-100% of the bottleneck in each search request. I don't need scoring, and I saw here: Sort search results | Elasticsearch Guide [8.13] | Elastic that scoring can be disabled by adding a sort, and then I saw some people achieve this by sorting on the _doc, but adding this doesn't change performance or what the profiler tells me. The build_scorer is still the bottleneck.

I saw TermInSetQuery seems to be a Lucene thing, so perhaps sorting on the request level in Elasticsearch doesn't affect what's going on in Lucene?

I tried Constant score query | Elasticsearch Guide [8.13] | Elastic and filters in general as well for the IDs query, no luck. Still computes the score in the profiler breakdown.

Kathleen_DeRusso · May 13, 2024, 3:33pm

Scoring shouldn't be too expensive to run, but you've listed the options available at search time. For example:

PUT score_test/_doc/abc
{ 
  "field": "foo"
}
PUT score_test/_doc/def
{
  "field": "bar"
}
PUT score_test/_doc/ghi
{
  "field": "baz"
}

GET score_test/_search
{
  "query": {
    "ids": {
      "values": ["abc", "def" ]
    }
  }
}


GET score_test/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "ids": {
            "values": [
              "abc",
              "def"
            ]
          }
        }
      ]
    }
  }
}

You could also consider using the get API for each single doc ID. That will get the document and will not return search results.

yottabyte · May 13, 2024, 3:36pm

As I could have 100k document ids, I think executing 100k GET requests would be worse than waiting the 10 seconds~ currently. What I don't understand is why is it saying it's computing the score when I'm telling it not to.

Kathleen_DeRusso · May 13, 2024, 4:00pm

Are you requesting 100K IDs in the request? I suspect the score is a red herring and it's simply the number of IDs you're requesting at once that's causing your performance bottleneck.

yottabyte · May 13, 2024, 4:36pm

I am indeed requesting 100k ids in the IDs query. I would hope this wouldn't be a problem, as that's only 3.6mbs of data. I use a IDs query for the documents in the other index so I can apply sorting and get source fields from the main index.

If it's a red herring, then that's very unfortunate, because I don't know how to fix the problem.

If it's a bug with how the Lucene method is being used, I was thinking about forking Elasticsearch, investigating, and making a PR to fix that or at least opening an issue on the Elasticsearch github.

system · June 10, 2024, 4:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is IDs query slow? Elasticsearch	18	8845	July 5, 2017
Score results based in a list of ids Elasticsearch	2	517	October 26, 2018
Ids Query is slow in 2b docs Elasticsearch	2	814	September 16, 2018
Performance of a wildcard index search for _id Elasticsearch	9	1276	November 11, 2020
Scan and scroll performance with IDs query Elasticsearch	6	3438	July 5, 2017

IDs Query Performance Problems

Related topics