Why is a trigram search on firstName so much faster than a trigram search on lastName?


(Mario Koehler) #1

We use ElasticSearch to index customer records (name + address) and in our queries we use trigram searches.

When i profiled some of the queries i noticed something peculiar that i don't understand, and i was hoping to get some insight.

Have a look at the profiler output below and check out the time spent on each of the trigram TermQueries

For example, the four TermQueries for the firstName "Sabine" all take around 208.000-244.000ns and thats more or less in line with all the other trigram queries in that search (fullName, street, city)...EXCEPT for lastName. Here the individual TermQueries take around 1.2-1.5 Million ns which is significantly slower and adds up to more than 24% of the total query time.

Initially i thought it might have something to do with the total amount of trigrams in each index. So because on average a first name might be shorter than a lastname, there will be less trigrams in the first name inverted index and it will be faster.

But that doesn't fit with the street field, which is on average probably the longest field and should then be the slowest. but it's not, it's in line (end even slightly faster) than the first name query.

by the way, there are around 15 million records in that index, so when i say "on average" then it should actually be a pretty good average to assume.

        7,2%       1.529.277ns ( 0,00s) BoostQuery (person.fullName.trigram:sab person.fullName.trigram:abi person.fullName.trigram:bin person.fullName.trigram:ine person.fullName.trigram:koe person.fullName.trigram:oen person.fullName.trigram:eni person.fullName.trigram:nig)^2.0
             11,7%         179.616ns ( 0,00s) TermQuery person.fullName.trigram:sab
              9,8%         149.382ns ( 0,00s) TermQuery person.fullName.trigram:abi
              9,5%         145.463ns ( 0,00s) TermQuery person.fullName.trigram:bin
              9,9%         151.973ns ( 0,00s) TermQuery person.fullName.trigram:ine
              8,6%         131.949ns ( 0,00s) TermQuery person.fullName.trigram:koe
              9,5%         144.911ns ( 0,00s) TermQuery person.fullName.trigram:oen
              9,3%         142.253ns ( 0,00s) TermQuery person.fullName.trigram:eni
              9,0%         138.348ns ( 0,00s) TermQuery person.fullName.trigram:nig
        6,3%       1.329.048ns ( 0,00s) SynonymQuery Synonym(person.firstName.phonetic:sYbini person.firstName.phonetic:sabin person.firstName.phonetic:sabini person.firstName.phonetic:savini person.firstName.phonetic:sobin person.firstName.phonetic:sobini person.firstName.phonetic:sovini person.firstName.phonetic:zYbini person.firstName.phonetic:zabini person.firstName.phonetic:zobini)
        5,7%       1.215.579ns ( 0,00s) BooleanQuery person.firstName.trigram:sab person.firstName.trigram:abi person.firstName.trigram:bin person.firstName.trigram:ine
             20,1%         244.919ns ( 0,00s) TermQuery person.firstName.trigram:sab
             17,6%         213.711ns ( 0,00s) TermQuery person.firstName.trigram:abi
             17,2%         208.705ns ( 0,00s) TermQuery person.firstName.trigram:bin
             18,1%         219.848ns ( 0,00s) TermQuery person.firstName.trigram:ine
        6,8%       1.436.954ns ( 0,00s) BoostQuery (Synonym(person.lastName.phonetic:kDnik person.lastName.phonetic:kYnik person.lastName.phonetic:koinik person.lastName.phonetic:kunik))^2.0
       24,3%       5.147.155ns ( 0,01s) BoostQuery (person.lastName.trigram:koe person.lastName.trigram:oen person.lastName.trigram:eni person.lastName.trigram:nig)^2.0
             29,3%       1.506.565ns ( 0,00s) TermQuery person.lastName.trigram:koe
             27,8%       1.432.001ns ( 0,00s) TermQuery person.lastName.trigram:oen
             25,9%       1.335.459ns ( 0,00s) TermQuery person.lastName.trigram:eni
             24,0%       1.233.435ns ( 0,00s) TermQuery person.lastName.trigram:nig
        2,7%         573.936ns ( 0,00s) BoostQuery (Synonym(address.street.phonetic:Dpdbl address.street.phonetic:apdbl address.street.phonetic:updbl address.street.phonetic:xDpdbl address.street.phonetic:xapdbl address.street.phonetic:xupdbl))^2.0
        6,7%       1.413.559ns ( 0,00s) BoostQuery (address.street.trigram:hau address.street.trigram:aup address.street.trigram:upt address.street.trigram:ptb address.street.trigram:tbu address.street.trigram:bue address.street.trigram:ueh address.street.trigram:ehl)^2.0
             12,6%         178.647ns ( 0,00s) TermQuery address.street.trigram:hau
             10,8%         152.014ns ( 0,00s) TermQuery address.street.trigram:aup
             10,6%         150.251ns ( 0,00s) TermQuery address.street.trigram:upt
              6,1%          86.559ns ( 0,00s) TermQuery address.street.trigram:ptb
              8,3%         116.926ns ( 0,00s) TermQuery address.street.trigram:tbu
             10,0%         141.978ns ( 0,00s) TermQuery address.street.trigram:bue
              9,9%         139.370ns ( 0,00s) TermQuery address.street.trigram:ueh
             10,2%         144.440ns ( 0,00s) TermQuery address.street.trigram:ehl

(Doug Turnbull) #2

One guess might be that the term dictionary for last names is larger as last names are more unique than first names. Larger term dictionaries lead to slower search. You could validate that by trying to compute the cardinality of each field.


(Mario Koehler) #3

I'm pretty new to elasticsearch, so i wanna be sure i understand you correctly.

I assume with "term dictionary" you mean the number of entries in the inverted index of the field? as far as i know there is no way to list them, or is there?

And with cardinality you mean the number of docId's (on average) associated with each term in the dictionary? So a field with fewer (on average) docId's associated with each term in the dictonary would have a higher cardinality than a field with more docId's, right?


(Mario Koehler) #4

In the meantime i had the opportunity to analyze how many trigrams are generated during indexing for each of the fields. and the analysis looks like this:

fullNameTrigrams  = 31923;    131.949ns -   179.616ns
firstNameTrigrams = 22863;    208.705ns -   244.919ns
lastNameTrigrams  = 28831;  1.233.435ns - 1.506.565ns
streetTrigrams    = 10915;     86.559ns -   178.647ns
cityTrigrams      = 4904;     137.265ns -   196.403ns

you can see on the left the number of unique trigrams generated for each field (this is for ca. 15 million records, which is our largest index). and on the left you can see how long trigram searches on that field typically take. what is striking - and i believe disproves the theory that the dictionary size is the culprit - is that the field with the smalles amount of trigrams (city) and the field with the largest amount of trigrams (fullName) basically have the same search times. But the lastName - which has fewer trigrams than fullName - takes ca. 10x as much time to do a search.

so bottom line is, i'm still looking for an explanation :slight_smile:


(Alan Woodward) #5

I'd expect TermQuery search times to be more or less linear in the size of the postings list for a given term, ie how many documents actually contain that term. So maybe the trigrams in your lastName field just hit many more documents? This is where Lucene 8/Elasticsearch 7 really speed things up, by allowing efficient skipping of blocks of documents that we know match, but won't score highly enough to get into the top-n hits.


(Mario Koehler) #6

Hallo Alan,

thank you for your reply. Just minutes ago I stumbled upon something peculiar that i believe contradicts your expectation that the search time should be linearly proportional to the number of documents that contain the term.

I tried the same query (same number/type of subqueries) but with different search terms (different name, different address) and i noticed this:

19,8% 6.600.572ns ( 0,01s) BooleanQuery person.firstName.trigram:doe person.firstName.trigram:oer person.firstName.trigram:ert person.firstName.trigram:rte
8,8% 582.310ns ( 0,00s) TermQuery person.firstName.trigram:doe
30,2% 1.992.752ns ( 0,00s) TermQuery person.firstName.trigram:oer
36,7% 2.423.682ns ( 0,00s) TermQuery person.firstName.trigram:ert
14,3% 945.375ns ( 0,00s) TermQuery person.firstName.trigram:rte
5,4% 1.792.049ns ( 0,00s) BoostQuery (person.lastName.phonetic:kYlir)^2.0
4,2% 1.389.988ns ( 0,00s) BoostQuery (person.lastName.trigram:koe person.lastName.trigram:oeh person.lastName.trigram:ehl person.lastName.trigram:hle person.lastName.trigram:ler)^2.0
16,4% 228.015ns ( 0,00s) TermQuery person.lastName.trigram:koe
14,6% 203.321ns ( 0,00s) TermQuery person.lastName.trigram:oeh
15,5% 215.691ns ( 0,00s) TermQuery person.lastName.trigram:ehl
14,5% 201.286ns ( 0,00s) TermQuery person.lastName.trigram:hle
16,0% 222.657ns ( 0,00s) TermQuery person.lastName.trigram:ler

You can see that in this case all the lastName trigram TermQueries are much faster and more in line with the overall performance level. Funny enough the trigram "koe" that is part of the lastName in both queries is fast in this example (~200.000ns) and slow in the previous example (~1.500.000ns).

since there should be the same amount of documents associated with that particular trigram ("koe") in both cases (the two queries are literally executed one after the other with no changes to the index), the number of documents associated shouldn't be a factor, right?

notice also, that in the second example suddenly two of the trigrams in firstName are slow too (in the ~2.000.000ns range). so it seems the effect is not limited to the lastName at all.

I have to admit, i'm very confused by now.


(Alan Woodward) #7

I've just spotted that you're talking nanosecond timings here, in which case judging what's happening from single runs is almost impossible, as there are any number of reasons why a particular bit of code might run slowly - OS-level IO, context switching, GC pauses, etc. You can benchmark things more thoroughly using something like https://github.com/elastic/rally, but to be honest at the timings and speeds that you're seeing I'm not sure it's worth it :slight_smile:


(Mario Koehler) #8

Is my interpretation of your last sentence right, that you feel it's already as fast as one could reasonably expect it to be for a query of this type?


(Mario Koehler) #9

I wasn't in the office yesterday, so i had some more time to think about your remarks. I get what you say about single runs not being a good measure. but the scenario here is actually a bit different, and i probably haven't communicated it well enough.

I have not just executed this once. I can reliably reproduce this behaviour as often as i want to and the values always show the same relationship (that ~10x factor). the absolute values might go up or down a couple of 10.000 ns, but the relationship always stays the same.

also, whenever i do this test i execute the profiling query a couple of times in a row until the overall execution time has stabilized. So the system is "warmed up".

and maybe also to explain a bit more how i came to take a look at the trigram query in the first place. i didn't look at the nanoseconds and thought "well thats a bit slow"...my experience with ES is far too limited to have a feel for what is "as fast as it can be". No, i looked at the overall execution time of the query and noticed that around 25% of it was "lost" in the trigram search. So i thought that might be a good place to start looking for optimizations. Then when i noticed those discrepancies in the excution times for different trigrams, that i had no logical explanation for i created this thread in the hopes to at least understand what is happening. Realizing that the answer may very well be "works as intended, move on :-)", but so far IMHO nobody really had a compelling explanation.


(David Turner) #10

I tried, and failed, to reproduce the effect described here in Elasticsearch 7.0.0:

DELETE /i

PUT /i
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "trigram_analyzer": {
          "tokenizer": "trigram_tokenizer"
        }
      },
      "tokenizer": {
        "trigram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "first": {
        "type": "text", "analyzer": "trigram_analyzer"
      },
      "last": {
        "type": "text", "analyzer": "trigram_analyzer"
      },
      "full": {
        "type": "text", "analyzer": "trigram_analyzer"
      }
    }
  }
}

POST /i/_bulk?refresh
{"index":{}}
{"first":"BRADLEY","last":"LERNER","full":"BRADLEY LERNER"}

I then indexed 15,000,000 other names generated according to the frequencies of this python package.

Finally I ran this:

POST /i/_forcemerge?max_num_segments=1

POST /i/_refresh

GET /i/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "first": "BRADLEY"
          }
        },
        {
          "match": {
            "last": "LERNER"
          }
        },
        {
          "match": {
            "full": "BRADLEY LERNER"
          }
        }
      ]
    }
  },
  "profile": "true"
}

The profile was consistently something like this:

      "type": "BooleanQuery",
      "description": "first:BRA first:RAD first:ADL first:DLE first:LEY",
      "time_in_nanos": 51380312,
...
      "type": "BooleanQuery",
      "description": "last:LER last:ERN last:RNE last:NER",
      "time_in_nanos": 102950335,
...
      "type": "BooleanQuery",
      "description": "full:BRA full:RAD full:ADL full:DLE full:LEY full:LER full:ERN full:RNE full:NER",
      "time_in_nanos": 258977895,

I.e. 51ms on the first field, 102ms on the last field, and 258ms on the full field. The distribution I used has many more last names than first names, so I think this makes sense.

I suspect there's something odd about the distribution of data in your lastName field.


(Mario Koehler) #11

Hello David,

first of all thank you for taking the time to analyze this. I didn't mention it, but i'm still on ES 6.5, since that is the most recent version supported by AWS Cloud right now. So maybe it's something that is fixed in ES7.

Good thing is that i learned today that we (Otto GmbH in Germany) actually have an Enterprise support contract with Elastic.co. It was brokered by a different department (and our company is quite large), so i didn't know about it until today.

Part of that support contract is also a license for the "real" (or non-AWS version or whatever you want to call it) and our sister department will provision us a small cluster next week (after easter hollidays) for us to test our application on a newer version of ES, where we will also have access to stuff like the graphical profiler of the XPack.

I also learned today that on Wednesday next week our sister department will actually have a workshop of some kind with folks from Elastic.co on our premise and i was invited to take part. So that will probably be an opportunity for me to discuss this topic further with one of your colleages in person.

all in all a lot of stuff to test and look forward to, which will hopefully shed some light on my questions/problems.

nonetheless, thanks again for your effort! and if i find out something relevant I'll update this thread with my findings.

best regards
Mario Köhler


(David Turner) #12

I wouldn't expect many changes in this area between 6.5 and 7.0, so I tried the same reproduction on 6.5.4. I'm only using my laptop and not controlling the environment very carefully so the absolute numbers are different, but the qualitative pattern is similar here too:

      "description": "first:BRA first:RAD first:ADL first:DLE first:LEY",
      "time_in_nanos": 38436533,
...
      "description": "last:LER last:ERN last:RNE last:NER",
      "time_in_nanos": 80046458,
...
      "description": "full:BRA full:RAD full:ADL full:DLE full:LEY full:LER full:ERN full:RNE full:NER",
      "time_in_nanos": 197419846,

Aha! That should help a lot. Make sure to mention this thread to your support engineers, in case it helps them to compare notes with me or Alan. The answer might well lie in the details of your specific data set, but a support contract makes it much easier to dig into this without so many concerns (GDPR etc.).