Improving search query time

Hi,

Setup: Elasticseach 6.3.0, 2 shards, 4 replicas on 5 dedicated data nodes.
I have a 10,000,000 products feed index - each document represent a product and look like:

    {
        "id": 123,
        "active": true,
        "popularity": 51,
        "url": "http://mysite.com/new-iPhone-12-pro-max" 
    }

A typical query is to find all products that contain the substring "iPhone".
I know that the regex query considers slow, but this is what we use in production (with lowercase tokenizer) and unfortunately is slow (about 1.5s).
Now, I have a known fact that I want to use to improve that: half of the products have a "popularity" value that is lte 0.
My first assumption is that filtering by "range" will improve the "took" - is that true?
In addition, I have added the "active" bool field that represents the rule above - (true if "popularity" > 0).
My second assumption is that filtering by "term" will perform better than "range" - is that true?

Here is the first query (from slowlog):

    {
      "from": 0,
      "size": 9,
      "query": {
        "bool": {
          "must": [
            {
              "function_score": {
                "query": {
                  "match_all": {
                    "boost": 1
                  }
                },
                "functions": [
                  {
                    "filter": {
                      "match_all": {
                        "boost": 1
                      }
                    },
                    "field_value_factor": {
                      "field": "popularity",
                      "factor": 1,
                      "modifier": "none"
                    }
                  }
                ],
                "score_mode": "multiply",
                "max_boost": 3.4028235e+38,
                "boost": 1
              }
            }
          ],
          "filter": [
            {
              "term": {
                "active": {
                  "value": true,
                  "boost": 1
                }
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "regexp": {
                      "url": {
                        "value": ".*iphone.*",
                        "flags_value": 65535,
                        "max_determinized_states": 10000,
                        "boost": 1
                      }
                    }
                  }
                ],
                "adjust_pure_negative": true,
                "boost": 1
              }
            }
          ],
          "adjust_pure_negative": true,
          "boost": 1
        }
      },
      "_source": {
        "includes": [],
        "excludes": [
          "active*",
          "popularity*"
        ]
      }
    }

I will add the range query if you think that my second assumption is wrong ("terms" will perform as "range")

I will appreciate any help!

Hi Itay.
Adding other mandatory clauses like a popularity filter can help speed up retrieval of other clause matches only if those other clauses are common terms in the index. For example, a clause on the ‘active:true’ term might match millions of documents but we can quickly skip over most of those matching Lucene document IDs if another mandatory clause tells us the first matching Lucene doc ID is several million documents into the index. We can skip over large sections of this list of “active” docs

Your problem is different. Each indexed term is an entire URL and likely unique and so each term will have a list of only one matching document. Not much scope to accelerate scans in lists of document IDs that only hold one value.

The bulk of your time is spent scanning through the large list of unique terms (urls) looking to see if they contain “iPhone” before loading matching document IDs.

To speed this up you have 2 options:

  1. Learn about tokenisation and index your URLs as ‘text’ fields that help make search terms like iPhones etc appear in your index or
  2. upgrade to 7.9+ and use the new wildcard field
2 Likes

thanks @Mark_Harwood!
I wasn't aware of the wildcard field - sounds interesting!
Also, sounds like I can try something like the n-gram tokenizer, is that make sense? (instead of upgrading the cluster).
The urls are unique - that's right; but I that I'm still missing a basic thing: for example if 50% (out of 10M documents) have active: true and the other 50% have active: false - shouldn't I expect a half took time? (combined with the regex). Isn't the filtering happened before the regex?

BTW, is there a way to avoid caching in the query so I could send the same query several times? I tried to add to the request_cache=false as param arg but it doesn't work..

Think of the index like the one at the back of a book. An alphabetised list of terms, each listing the pages they occur on in ascending order. That list of pages can be very long for common words which is why books typically don’t list words like “the”. However a search index is different and holds both rare and common words. The lists of pages matching a term include “skip” markers allowing the reader to avoid reading whole blocks of page numbers. That’s why when you match 2 mandatory terms a rare term can get common terms to skip over long sections of lists.

Back to your main issue. Imagine trying to find all words in a book that contain the letter A (as opposed to starting with an A). The alphabetised ordering of the words is of no use and you have to do a linear scan of every index entry. Very expensive.

Re ngrams - you can use those in your mapping of text fields but the advantage of the new wildcard field is it will do that for you AND verify that character sequences longer than your ngram size are genuine matches.

1 Like

Thanks again @Mark_Harwood!

Is this "block of page numbers" == within the Lucene index right?
Will it help if I force merge segments? or will it do the wrong impact in my case?
Can I use the fact that I'm "baking" the index in advance and then making it "read-only" for search somehow?
For instance, ensure that all active documents living in the same shard and non-active in the other one? Are those indexed terms per shards or per index?

Yes. Using “Vint” - variable int gap encoded numbers for higher levels of compression in these long lists.

The phrase “reorganising the deck chairs on the titanic” springs to mind.
Merging segments is like merging books. In your case the books share no words in common so the index of unique words doesn't get any shorter to scan.
The speed up will come from reorganising the index in one of the two ways I listed previously.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.