I have documents with the same ids (guid) split across indices in my system. When a search is run across indices, I join them in my client code. This is done using IDs queries: IDs | Elasticsearch Guide [8.13] | Elastic
The indices are millions of documents at times, sometimes over 50gb a pop, but there are 5 shards per index and multiple nodes and 31gb of ram dedicated to the jvms.
The ID queries are sometimes hundred of thousands of guid ids.
Using the query profiler in Kibana, I found the build_scorer in TermInSetQuery was 99-100% of the bottleneck in each search request. I don't need scoring, and I saw here: Sort search results | Elasticsearch Guide [8.13] | Elastic that scoring can be disabled by adding a sort, and then I saw some people achieve this by sorting on the _doc, but adding this doesn't change performance or what the profiler tells me. The build_scorer is still the bottleneck.
I saw TermInSetQuery seems to be a Lucene thing, so perhaps sorting on the request level in Elasticsearch doesn't affect what's going on in Lucene?
As I could have 100k document ids, I think executing 100k GET requests would be worse than waiting the 10 seconds~ currently. What I don't understand is why is it saying it's computing the score when I'm telling it not to.
Are you requesting 100K IDs in the request? I suspect the score is a red herring and it's simply the number of IDs you're requesting at once that's causing your performance bottleneck.
I am indeed requesting 100k ids in the IDs query. I would hope this wouldn't be a problem, as that's only 3.6mbs of data. I use a IDs query for the documents in the other index so I can apply sorting and get source fields from the main index.
If it's a red herring, then that's very unfortunate, because I don't know how to fix the problem.
If it's a bug with how the Lucene method is being used, I was thinking about forking Elasticsearch, investigating, and making a PR to fix that or at least opening an issue on the Elasticsearch github.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.