Not able to match on short word


(Martin Dessureault) #1

My application relies on elasticsearch to search for people using different attributes. The user enters someone's first name, last name or both and the ES client does a boolean query across a handful of fields to find matching terms.

My problem is that, the way I have it setup, It cannot find people with short names like "Mo Jo" - instead ES would return "Mo Johnson" even if there is a perfect match for "Mo Jo". It is almost like short strings are not indexed at all.

Here is my template:

{
  "settings": {
    "number_of_shards": 10,
    "analysis": {
      "analyzer": {
        "exact": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ]
        },
        "startswith": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "startswith_filter"
          ]
        }
      },
      "filter": {
        "startswith_filter": {
          "type": "edgeNGram",
          "min_gram": 1,
          "max_gram": 20
        }
      }
    }
  },
  "mappings": {
    "people": {
      "properties": {
        "display_id": {
          "type": "string",
          "fields": {
            "display_id": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "email": {
          "type": "string",
          "fields": {
            "email": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "first_name": {
          "type": "string",
          "fields": {
            "first_name": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "last_name": {
          "type": "string",
          "fields": {
            "last_name": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "full_name": {
          "type": "string",
          "analyzer": "exact",
          "doc_values": true
        },
        "phone": {
          "type": "string",
          "index": "no",
          "doc_values": true
        },
        "external_id": {
          "type": "string",
          "index": "no",
          "doc_values": true
        }
      }
    }
  }
}

Here is the query, where 'q' contains what the user entered.

{
  "query": {
    "bool": {
      "should": [
        { "match": { "first_name": { "query": q }}},
        { "match": { "first_name._startswith": { "query": q }}},
        { "match": { "last_name": { "query": q }}},
        { "match": { "last_name._startswith": { "query": q }}},
        { "match": { "full_name": { "query": q }}},
        { "match": { "email": { "query": q }}},
        { "match": { "email._startswith": { "query": q }}},
        { "match": { "display_id": { "query": q }}},
        { "match": { "display_id._startswith": { "query": q }}}
      ],
      "minimum_should_match": 1
    }
  }
}

I play quite a bit with different analyzers and filters but I can't never find the solutions that allows to search for short names.

Any pointers greatly appreciated.


(Adrien Grand) #2

Can you run the explain API on a document that is supposed to match but does not? https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html


(Martin Dessureault) #3

The response is too large to be pasted here so here is a link to it:
https://0bin.net/paste/cSsUYdESnQLmexBB#3GZvIMIfELcDiUFPbhG2g0hkzIdts+FDVqjo0yyoCC9


(Martin Dessureault) #4

Here are the top 10 results - running the same query without _explain:

{
  "took" : 116,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 701740,
    "max_score" : 9.206999,
    "hits" : [ {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "744125",
      "_score" : 9.206999,
      "_source":{"_id":"744125","raw_person_id":"113060412","full_name":"Heather Heath","first_name":"Heather","last_name":"Heath","email":"heather.heath@example.com","phone":"","external_id":"744125"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "982642",
      "_score" : 8.587239,
      "_source":{"_id":"982642","raw_person_id":"141060211","full_name":"Heather Jones","first_name":"Heather","last_name":"Jones","email":"heather.jones@example.com","phone":"","external_id":"982642"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1005621",
      "_score" : 8.587239,
      "_source":{"_id":"1005621","raw_person_id":"143620713","full_name":"Heather Jones","first_name":"Heather","last_name":"Jones","email":"heather.jones@example.com","phone":"","external_id":"1005621"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1132129",
      "_score" : 8.570217,
      "_source":{"_id":"1132129","raw_person_id":"157601433","full_name":"Heather Johanson","first_name":"Heather","last_name":"Johanson","email":"heather.johanson@example.com","phone":"","external_id":"1132129"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "841609",
      "_score" : 8.56539,
      "_source":{"_id":"841609","raw_person_id":"125833022","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"841609"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "636921",
      "_score" : 8.555441,
      "_source":{"_id":"636921","raw_person_id":"089574131","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"636921"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1137812",
      "_score" : 8.555441,
      "_source":{"_id":"1137812","raw_person_id":"158201865","full_name":"Heather Jones","first_name":"Heather","last_name":"Jones","email":"heather.jones@example.com","phone":"","external_id":"1137812"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1254778",
      "_score" : 8.5469055,
      "_source":{"_id":"1254778","raw_person_id":"170870344","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"1254778"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1018989",
      "_score" : 8.5469055,
      "_source":{"_id":"1018989","raw_person_id":"145178266","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"1018989"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "173812",
      "_score" : 8.5469055,
      "_source":{"_id":"173812","raw_person_id":"017780934","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"173812"}
    } ]
  }
}

(Peter van der Weerd) #5

This is basically a big OR query, where the individual sub queries are incomparable due to big differences in frequencies. Look at your explain, and not how wildly the scores of the subs vary.

This will cause some sub queries to completely overrule the others.

Can you try to wrap the subs as constant scores? So map them all to 1?

/P


(Martin Dessureault) #6

Not so sure on how to make constant_score work; I get 0 results now :frowning:

Here is the query now:

{
 "query": {
  "constant_score": {
   "filter": [
    { "term": { "first_name": "Heather Jo" }},
    { "term": { "first_name._startswith": "Heather Jo" }},
    { "term": { "last_name": "Heather Jo" }},
    { "term": { "last_name._startswith": "Heather Jo" }},
    { "term": { "full_name": "Heather Jo" }},
    { "term": { "email": "Heather Jo" }},
    { "term": { "email._startswith": "Heather Jo" }},
    { "term": { "display_id": "Heather Jo" }},
    { "term": { "display_id._startswith": "Heather Jo" }}
   ]
  }
 }
}

Response:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

(Peter van der Weerd) #7

I was more thinking of:

{
  "query": {
    "bool": {
      "should": [
        { "constant_score": { "filter": { "match": { "first_name": "Heather Jo" }}}},
        { "constant_score": { "filter": { "match": { "first_name._startswith": "Heather Jo" }}}},
        etc...
      ],
      "minimum_should_match": 1
    }
  }
}

Note that the 2nd clause implicitly includes the first clause. Meaning that you are double scoring exact name matches. That might or might not be what you want...

The difference between a match and a term-query is that a match query is using the analyzer while a term-query doesn't. Don't get confused by the filter here. In the latest versions of ES filters and queries are interchangeable.


(Martin Dessureault) #8

Thanks for the reply Peter however, what you suggested resulted in the error "No filter registered for [match]];"

I then try changing the query to use term instead of match and got 0 results as before. :frowning:

As to your comment about double scoring exact match - I do want to give preference to perfect name matching over partial matches so I guess this turns out to be ok.


(Peter van der Weerd) #9

Ah. What version of ES are you using?
Try changing the 'filter' hash in the constant_score into 'query'. But keep using the match instead of term.


(Martin Dessureault) #10

I'm using version 1.5.0 - old but I can't upgrade at the moment.

I did try what you suggested and it returns results that seem further away from the way I was querying originally. Unfortunately - no closer to a solution.


(Peter van der Weerd) #11

Hmm...
Can you send me:

  • The exact query,
  • the record that you consider to be the good answer,
  • the record that is now on top, and
  • the explains for both records?

(Martin Dessureault) #12

The query:

{
  "query": {
    "bool": {
      "should": [
        { "constant_score": { "query": { "match": { "first_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "first_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "last_name": { "query": "Heather Jo"}}}}},
        { "constant_score": { "query": { "match": { "last_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "full_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id._startswith": { "query": "Heather Jo" }}}}}
      ]
    }
  }
}

The good record:

{
  "_id": "215027",
  "raw_person_id": "021985191",
  "full_name": "Heather Jo",
  "first_name": "Heather",
  "last_name": "Jo",
  "email": "",
  "phone": "",
  "external_id": "215027",
  "display_id": "021985191"
}

The top record:

{
  "_id": "1039035",
  "raw_person_id": "147490243",
  "full_name": "Joe Hernandez",
  "first_name": "Joe",
  "last_name": "Hernandez",
  "email": "joe.hernandez@example.com",
  "phone": "",
  "external_id": "1039035",
  "display_id": "147490243"
 }

Explain good record: http://0bin.net/paste/kQNLw6JuKwWlTLbH#2PzEXXdGCWSypIdMYjYFnSXSg1THFy9sMMrGy6QJsRu
Explain top record: http://0bin.net/paste/8fdAOvmYeYS6XCTu#K4erFHgorqJstIdCHZnegDJdRFb9eIKb7POGCuP27G8


(Martin Dessureault) #13

Looking at the explain documents, why would ES try to match he, hea, heat, heath, heathe, heather and then start again with ae, aet, aeth, aethe, aether, and so on. I thought that the edgeNgram decomposed the token from the beginning only.


(Peter van der Weerd) #14

Right, that's why there is explain :slightly_smiling:
Yeah. I don't get that, and I don;t have much time right now. Also, at query time the exact filter should be used...
So 2 errors at least.
Use the _analyze api to check the workings of your analyzer.

And then I see the coord-crap in the explain.
Can you include disable_coord: true in the bool query and retry?


(Martin Dessureault) #15

Sorry for the long delay - I'm back at this.
I played with the analyze api and my analyzers are working as designed (as far as I can tell).

I also ran the same query, this time specifying the disable_coord property and, even though I got different results, they are nowhere close to results I'm expecting. If you want I can post the new results here but they don't seem to help (at least for me).

I was also able to setup a new server on my local machine running ES 2.1.1 with the same results.

I tried to play with boost so I can boost an exact match but boosting doesn't seem to help. It seems like everything I try doesn't want to bring those short names back.

Help please :slight_smile:


(Peter van der Weerd) #16

So, I create an index with the settings you supplied, containing the 2 records that you supplied earlier.
Doing the query

curl -XPOST 'http://localhost:9200/martin/_search' -d '
{
  "explain": true,
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "constant_score": { "query": { "match": { "first_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "first_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "last_name": { "query": "Heather Jo"}}}}},
        { "constant_score": { "query": { "match": { "last_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "full_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id._startswith": { "query": "Heather Jo" }}}}}
      ]
    }
  }
}'

This returns the 2 records with 'heather joe' scored as 1.67 and Joe Hernandez scored as 0.67. Which makes totally sense to me.
In the explain I doen't see heather scored by he, hea, heat, etc!!!!

Even a better query: Searching for heather jo means IMHO that heather should not be completed, while jo should.
This leads to the following query:

{
  "explain": true,
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "constant_score": { "query": { "match": { "first_name": { "query": "Heather" }}}}},
        { "constant_score": { "query": { "match": { "first_name._startswith": { "query": "Jo" }}}}},
        { "constant_score": { "query": { "match": { "last_name": { "query": "Heather"}}}}},
        { "constant_score": { "query": { "match": { "last_name._startswith": { "query": "Jo" }}}}},
        { "constant_score": { "query": { "match": { "email": { "query": "Heather" }}}}},
        { "constant_score": { "query": { "match": { "email._startswith": { "query": "Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id": { "query": "Heather" }}}}},
        { "constant_score": { "query": { "match": { "display_id._startswith": { "query": "Jo" }}}}}
      ]
    }
  }
}'

If I execute that query, I get only 1 result: heather joe

But, before you switch to this approach, you have to make sure that you know what is your error in the original query! In the explain that you send me a few weeks ago, the explain showed that you were searching for edges. That is an indication that either your mapping didn't accept the search_analyzer, or that the analyzer is simply wrong.

Note that with what you supplied it works at my place!

/P


(Martin Dessureault) #17

well well - I found the problem and I should be ashamed :frowning:

As you suspected, the mapping the index was under is not the mapping I thought it was. We put mappings in templates, in order, to be applied to new indices. What I posted here is the last mapping, not the end result of applying all templates. Turns out, there is no way to remove a setting from one template to another. You can just change its value but not remove the attribute (field) altogether. This turned out to leave an analyzer (configured as an ngram) as the default search analyzer which gave the results we were seeing. After adding another template making sure that the result of applying all templates was the mapping I wanted and voila - things started to work just fine.

Thanks Peter for your help and sorry to have wasted your time with a... user error...


(system) #18