Run match query over specific documents

I have a large index (10M+ docs with large bodies of text) and I'm trying to speed up queries for large bodies of text over it.

I know in advance in which documents the "text" is likely to be (i.e. I know the doc's likely "title"), so ideally I'd like the match query to run only over the documents that have the specified title, as that would likely be much faster than matching against the whole index.

Is this a viable approach?

I've already tried a combination of must/should and must/must of a terms and match query, as well as a filter over the title and over an ids query, but the match query's run time is unchanged, it appears that filtering is done over the results from match instead of before.

My expectation was that the terms query/filter, which theoretically has the least cost, would be computed first, and only the matching docs would be scanned for the match query, or at least that was my understanding of the docs (" the goal of filtering is to reduce the number of documents that have to be examined" ).

Mappings:

{
"myIndex": {
"mappings": {
"page": {
"_all": {
"enabled": false
},
"properties": {
"text": {
"type": "text",
"analyzer": "myAnalyzer"
},
"title": {
"type": "keyword"
}
}
}
}
}
}

Queries tried so far:

{
"query": {
"bool":{
"must":{
"terms" : {
"title" : ["t1","t2","t3"]
}
},
"should": { //also tried a "must" query here
"match" : {
"text" : "large body of text here"
}
}
}
}
}

{
"query": {
"bool":{
"filter":{
"terms" : {
"title" : ["t1","t2","t3"]
}
},
"should": {
"match" : {
"text" : "large body of text here"
}
}
}
}
}

{
"query": {
"bool":{
"filter":{
"ids" : {
"values" : ["33934108","1196927","2235504"]
},
"should": {
"match" : {
"text" : "large body of text here"
}
}
}
}
}

The match query is typically very fast, and 10M documents is not necessarily that much data by Elasticsearch standards, so I'm not sure there's a lot of optimization you need to do or even can do here.

Having said that, what you're trying to do can be achieved using rescoring. For example, your second query could be rewritten as:

{
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "title": [
            "t1",
            "t2",
            "t3"
          ]
        }
      }
    }
  },
  "rescore": {
    "window_size": 10,
    "query": {
      "rescore_query": {
        "match": {
          "text": "large body of text here"
        }
      }
    }
  }
}

The idea here is that the match query now will only be executed on the top 10 documents for the terms query, on each of the shards. You would want to make sure that window_size is larger than the number of terms in that first query.

1 Like

Thanks for the suggestion, I tested your query and the filter part runs in 200ms but the rescore/match query still takes 41 minutes to complete, even when filtering for only 1 doc.

I set the filter to only one item, to run the rescore/match query over one specific document in which there are a few hits. The "text" field is shingled, so this doc's term vector contains ~1700 tokens. The "large body of text here" in the match query is analyzed into about 180k shingles/terms, and (according to the profiler) transformed into a BooleanQuery composed of several TermQuery, one per token/shingle; I had to set indices.query.bool.max_clause_count to 1024000 for this test to work.

So, to test if the problem was solely due to the rescore/match query, namely because the input text is so large, I created a new index (same mappings and analyzers) containing only the one document previously used in the filter. Using this index, the exact same rescore/match query now runs in 5 seconds.

I'm a bit confused whether this slowness is related:

  1. to the size of the input text, although the query runs in 5 seconds with the same large input on an index with only 1 document
  2. or to filtering not being applied somehow, and the match query being applied against more documents than specified in the filter, because on an index with 10M+ docs the query takes 41 minutes. But I did notice that only one of the two disks configured for path.data is active (iostat) when using rescoring, which may suggest that only some shards are used for the rescore/match (therefore, the match query does appear to be filtered despite the low performance).

PS: I'm running ES 5.6.4 on a single machine using a 4GB heap (usage is below 50%), ~10GB RAM kept free for filesystem cache and 2xSSD each capable of 250MB/s.

To understand what's going on, it would be good to see the response of the profile API. Can you share those responses for the query on both the indexes? (You may have to create a gist as the response size may exceed what you can post here.)

When you try to use the _analyze API and analyze your query string, does it return the same output for both the indexes? Does it take an equal amount of time to return?

1 Like

Apologies for the delay, here is the info you requested:

When you try to use the _analyze API and analyze your query string, does it return the same output for both the indexes? Does it take an equal amount of time to return?

Yes, on both indexes it takes ~4 seconds and returns exactly the same output.

it would be good to see the response of the profile API

I had to upload the output of _profile to GitLab as both files have ~200MB of prettified JSON.

The same rescore/match for "large body of text" query was run over an index with ~10M documents (output in the profileLargeIndex.json file) and over an index with 1 document (profileSmallIndex.json file). The index with 10M documents took ~3387 seconds whereas the index with 1 doc took ~5 seconds.

You can access both files at https://gitlab.com/jgpt/test1/tree/master/test1

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.