Selecting documents to search (performance issues)

I have an elasticsearch document repository with ~15M documents.

Each document has an unique 11-char string field (comes from a mongo DB) that is unique to the document. This field is indexed as keyword.

I'm using C#.

When I run a search, I want to be able to limit the search to a set of documents that I specify (via some list of the unique field ids).

My query text uses bool with must to supply a filter for the unique identifiers and additional clauses to actually search the documents. See example below.

To search a large number of documents, I generate multiple query strings and run them concurrently. Each query handles up to 64K unique ids (determined by the limit on terms).

In this case, I have 262,144 documents to search (list comes, at run time, from a separate mongo DB query). So my code generates 4 query strings (see example below).

I run them concurrently.

Unfortunately, this search takes over 22 seconds to complete.

When I run the same search but drop the terms node (so it searches all the documents), a single such query completes the search in 1.8 seconds.

An incredible difference.

So my question: Is there an efficient way to specify which documents are to be searched (when each document has a unique self-identifying keyword field)?

I want to be able to specify up to a few 100K of such unique ids.

Here's an example of my search specifying unique document identifiers:

{
    "_source" : "talentId",
    "from" : 0,
    "size" : 10000,
    "query" : {
        "bool" : {
            "must" : [
                {
                    "bool" : {
                        "must" : [  {  "match_phrase" : { "freeText" : "java" } },
                                          {  "match_phrase" : { "freeText" : "unix" } },
                                          {  "match_phrase" : { "freeText" : "c#" } },
                                          {  "match_phrase" : { "freeText" : "cnn" } }    ]
                    }
                },
                {
                    "bool" : {
                        "filter" : {
                            "bool" : {
                                "should" : [
                                    {
                                        "terms" : {
                                            "talentId" : [ "goGSXMWE1Qg",  "GvTDYS6F1Qg",
                                                           "-qa_N-aC1Qg", "iu299LCC1Qg",
                                                           "0p7SpteI1Qg",  ... 4,995 more ...  ]
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
} 

How are you finding these ids? Is there done grouping you could use instead to reduce the number of terms? As far as I know there is no magic way to dramatically improve the performance of this type of queries.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.