Selecting documents to search (performance issues)

djmarcus · April 20, 2020, 7:27pm

I have an elasticsearch document repository with ~15M documents.

Each document has an unique 11-char string field (comes from a mongo DB) that is unique to the document. This field is indexed as keyword.

I'm using C#.

When I run a search, I want to be able to limit the search to a set of documents that I specify (via some list of the unique field ids).

My query text uses bool with must to supply a filter for the unique identifiers and additional clauses to actually search the documents. See example below.

To search a large number of documents, I generate multiple query strings and run them concurrently. Each query handles up to 64K unique ids (determined by the limit on terms).

In this case, I have 262,144 documents to search (list comes, at run time, from a separate mongo DB query). So my code generates 4 query strings (see example below).

I run them concurrently.

Unfortunately, this search takes over 22 seconds to complete.

When I run the same search but drop the terms node (so it searches all the documents), a single such query completes the search in 1.8 seconds.

An incredible difference.

So my question: Is there an efficient way to specify which documents are to be searched (when each document has a unique self-identifying keyword field)?

I want to be able to specify up to a few 100K of such unique ids.

Here's an example of my search specifying unique document identifiers:

{
    "_source" : "talentId",
    "from" : 0,
    "size" : 10000,
    "query" : {
        "bool" : {
            "must" : [
                {
                    "bool" : {
                        "must" : [  {  "match_phrase" : { "freeText" : "java" } },
                                          {  "match_phrase" : { "freeText" : "unix" } },
                                          {  "match_phrase" : { "freeText" : "c#" } },
                                          {  "match_phrase" : { "freeText" : "cnn" } }    ]
                    }
                },
                {
                    "bool" : {
                        "filter" : {
                            "bool" : {
                                "should" : [
                                    {
                                        "terms" : {
                                            "talentId" : [ "goGSXMWE1Qg",  "GvTDYS6F1Qg",
                                                           "-qa_N-aC1Qg", "iu299LCC1Qg",
                                                           "0p7SpteI1Qg",  ... 4,995 more ...  ]
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

Christian_Dahlqvist · April 20, 2020, 7:48pm

How are you finding these ids? Is there done grouping you could use instead to reduce the number of terms? As far as I know there is no magic way to dramatically improve the performance of this type of queries.

system · May 18, 2020, 7:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ORing a text field with a unique identifier keyword field leading to increased next_doc count and poor performance Elasticsearch	9	207	April 19, 2023
Correlating Documents Based On Unique Field Elasticsearch	2	248	October 19, 2021
Need help with query performance issue Elasticsearch	1	533	July 5, 2017
Super slow bool query with terms filter array Elasticsearch	7	1281	December 24, 2018
Limit number of documents returned by each clause in Search query Elasticsearch	3	502	October 8, 2019

Selecting documents to search (performance issues)

Related topics