Not able to match on short word

mdessureault · March 8, 2016, 3:06pm

My application relies on elasticsearch to search for people using different attributes. The user enters someone's first name, last name or both and the ES client does a boolean query across a handful of fields to find matching terms.

My problem is that, the way I have it setup, It cannot find people with short names like "Mo Jo" - instead ES would return "Mo Johnson" even if there is a perfect match for "Mo Jo". It is almost like short strings are not indexed at all.

Here is my template:

{
  "settings": {
    "number_of_shards": 10,
    "analysis": {
      "analyzer": {
        "exact": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ]
        },
        "startswith": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "startswith_filter"
          ]
        }
      },
      "filter": {
        "startswith_filter": {
          "type": "edgeNGram",
          "min_gram": 1,
          "max_gram": 20
        }
      }
    }
  },
  "mappings": {
    "people": {
      "properties": {
        "display_id": {
          "type": "string",
          "fields": {
            "display_id": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "email": {
          "type": "string",
          "fields": {
            "email": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "first_name": {
          "type": "string",
          "fields": {
            "first_name": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "last_name": {
          "type": "string",
          "fields": {
            "last_name": {
              "type": "string",
              "analyzer": "exact"
            },
            "_startswith": {
              "type": "string",
              "analyzer": "startswith",
              "search_analyzer": "exact"
            }
          }
        },
        "full_name": {
          "type": "string",
          "analyzer": "exact",
          "doc_values": true
        },
        "phone": {
          "type": "string",
          "index": "no",
          "doc_values": true
        },
        "external_id": {
          "type": "string",
          "index": "no",
          "doc_values": true
        }
      }
    }
  }
}

Here is the query, where 'q' contains what the user entered.

{
  "query": {
    "bool": {
      "should": [
        { "match": { "first_name": { "query": q }}},
        { "match": { "first_name._startswith": { "query": q }}},
        { "match": { "last_name": { "query": q }}},
        { "match": { "last_name._startswith": { "query": q }}},
        { "match": { "full_name": { "query": q }}},
        { "match": { "email": { "query": q }}},
        { "match": { "email._startswith": { "query": q }}},
        { "match": { "display_id": { "query": q }}},
        { "match": { "display_id._startswith": { "query": q }}}
      ],
      "minimum_should_match": 1
    }
  }
}

I play quite a bit with different analyzers and filters but I can't never find the solutions that allows to search for short names.

Any pointers greatly appreciated.

jpountz · March 8, 2016, 5:33pm

Can you run the explain API on a document that is supposed to match but does not? https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

mdessureault · March 8, 2016, 7:27pm

The response is too large to be pasted here so here is a link to it:
https://0bin.net/paste/cSsUYdESnQLmexBB#3GZvIMIfELcDiUFPbhG2g0hkzIdts+FDVqjo0yyoCC9

mdessureault · March 8, 2016, 7:40pm

Here are the top 10 results - running the same query without _explain:

{
  "took" : 116,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 701740,
    "max_score" : 9.206999,
    "hits" : [ {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "744125",
      "_score" : 9.206999,
      "_source":{"_id":"744125","raw_person_id":"113060412","full_name":"Heather Heath","first_name":"Heather","last_name":"Heath","email":"heather.heath@example.com","phone":"","external_id":"744125"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "982642",
      "_score" : 8.587239,
      "_source":{"_id":"982642","raw_person_id":"141060211","full_name":"Heather Jones","first_name":"Heather","last_name":"Jones","email":"heather.jones@example.com","phone":"","external_id":"982642"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1005621",
      "_score" : 8.587239,
      "_source":{"_id":"1005621","raw_person_id":"143620713","full_name":"Heather Jones","first_name":"Heather","last_name":"Jones","email":"heather.jones@example.com","phone":"","external_id":"1005621"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1132129",
      "_score" : 8.570217,
      "_source":{"_id":"1132129","raw_person_id":"157601433","full_name":"Heather Johanson","first_name":"Heather","last_name":"Johanson","email":"heather.johanson@example.com","phone":"","external_id":"1132129"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "841609",
      "_score" : 8.56539,
      "_source":{"_id":"841609","raw_person_id":"125833022","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"841609"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "636921",
      "_score" : 8.555441,
      "_source":{"_id":"636921","raw_person_id":"089574131","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"636921"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1137812",
      "_score" : 8.555441,
      "_source":{"_id":"1137812","raw_person_id":"158201865","full_name":"Heather Jones","first_name":"Heather","last_name":"Jones","email":"heather.jones@example.com","phone":"","external_id":"1137812"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1254778",
      "_score" : 8.5469055,
      "_source":{"_id":"1254778","raw_person_id":"170870344","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"1254778"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "1018989",
      "_score" : 8.5469055,
      "_source":{"_id":"1018989","raw_person_id":"145178266","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"1018989"}
    }, {
      "_index" : "people_index",
      "_type" : "people",
      "_id" : "173812",
      "_score" : 8.5469055,
      "_source":{"_id":"173812","raw_person_id":"017780934","full_name":"Heather Johnson","first_name":"Heather","last_name":"Johnson","email":"heather.johnson@example.com","phone":"","external_id":"173812"}
    } ]
  }
}

Peter_van_der_Weerd · March 8, 2016, 7:46pm

This is basically a big OR query, where the individual sub queries are incomparable due to big differences in frequencies. Look at your explain, and not how wildly the scores of the subs vary.

This will cause some sub queries to completely overrule the others.

Can you try to wrap the subs as constant scores? So map them all to 1?

/P

mdessureault · March 8, 2016, 8:34pm

Not so sure on how to make constant_score work; I get 0 results now

Here is the query now:

{
 "query": {
  "constant_score": {
   "filter": [
    { "term": { "first_name": "Heather Jo" }},
    { "term": { "first_name._startswith": "Heather Jo" }},
    { "term": { "last_name": "Heather Jo" }},
    { "term": { "last_name._startswith": "Heather Jo" }},
    { "term": { "full_name": "Heather Jo" }},
    { "term": { "email": "Heather Jo" }},
    { "term": { "email._startswith": "Heather Jo" }},
    { "term": { "display_id": "Heather Jo" }},
    { "term": { "display_id._startswith": "Heather Jo" }}
   ]
  }
 }
}

Response:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Peter_van_der_Weerd · March 9, 2016, 8:07am

I was more thinking of:

{
  "query": {
    "bool": {
      "should": [
        { "constant_score": { "filter": { "match": { "first_name": "Heather Jo" }}}},
        { "constant_score": { "filter": { "match": { "first_name._startswith": "Heather Jo" }}}},
        etc...
      ],
      "minimum_should_match": 1
    }
  }
}

Note that the 2nd clause implicitly includes the first clause. Meaning that you are double scoring exact name matches. That might or might not be what you want...

The difference between a match and a term-query is that a match query is using the analyzer while a term-query doesn't. Don't get confused by the filter here. In the latest versions of ES filters and queries are interchangeable.

mdessureault · March 9, 2016, 2:58pm

Thanks for the reply Peter however, what you suggested resulted in the error "No filter registered for [match]];"

I then try changing the query to use term instead of match and got 0 results as before.

As to your comment about double scoring exact match - I do want to give preference to perfect name matching over partial matches so I guess this turns out to be ok.

Peter_van_der_Weerd · March 9, 2016, 3:30pm

Ah. What version of ES are you using?
Try changing the 'filter' hash in the constant_score into 'query'. But keep using the match instead of term.

mdessureault · March 9, 2016, 3:41pm

I'm using version 1.5.0 - old but I can't upgrade at the moment.

I did try what you suggested and it returns results that seem further away from the way I was querying originally. Unfortunately - no closer to a solution.

Peter_van_der_Weerd · March 10, 2016, 8:08am

Hmm...
Can you send me:

The exact query,
the record that you consider to be the good answer,
the record that is now on top, and
the explains for both records?

mdessureault · March 10, 2016, 8:52pm

The query:

{
  "query": {
    "bool": {
      "should": [
        { "constant_score": { "query": { "match": { "first_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "first_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "last_name": { "query": "Heather Jo"}}}}},
        { "constant_score": { "query": { "match": { "last_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "full_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id._startswith": { "query": "Heather Jo" }}}}}
      ]
    }
  }
}

The good record:

{
  "_id": "215027",
  "raw_person_id": "021985191",
  "full_name": "Heather Jo",
  "first_name": "Heather",
  "last_name": "Jo",
  "email": "",
  "phone": "",
  "external_id": "215027",
  "display_id": "021985191"
}

The top record:

{
  "_id": "1039035",
  "raw_person_id": "147490243",
  "full_name": "Joe Hernandez",
  "first_name": "Joe",
  "last_name": "Hernandez",
  "email": "joe.hernandez@example.com",
  "phone": "",
  "external_id": "1039035",
  "display_id": "147490243"
 }

Explain good record: http://0bin.net/paste/kQNLw6JuKwWlTLbH#2PzEXXdGCWSypIdMYjYFnSXSg1THFy9sMMrGy6QJsRu
Explain top record: http://0bin.net/paste/8fdAOvmYeYS6XCTu#K4erFHgorqJstIdCHZnegDJdRFb9eIKb7POGCuP27G8

mdessureault · March 10, 2016, 9:15pm

Looking at the explain documents, why would ES try to match he, hea, heat, heath, heathe, heather and then start again with ae, aet, aeth, aethe, aether, and so on. I thought that the edgeNgram decomposed the token from the beginning only.

Peter_van_der_Weerd · March 10, 2016, 10:03pm

Right, that's why there is explain
Yeah. I don't get that, and I don;t have much time right now. Also, at query time the exact filter should be used...
So 2 errors at least.
Use the _analyze api to check the workings of your analyzer.

And then I see the coord-crap in the explain.
Can you include disable_coord: true in the bool query and retry?

mdessureault · March 14, 2016, 6:47pm

Sorry for the long delay - I'm back at this.
I played with the analyze api and my analyzers are working as designed (as far as I can tell).

I also ran the same query, this time specifying the disable_coord property and, even though I got different results, they are nowhere close to results I'm expecting. If you want I can post the new results here but they don't seem to help (at least for me).

I was also able to setup a new server on my local machine running ES 2.1.1 with the same results.

I tried to play with boost so I can boost an exact match but boosting doesn't seem to help. It seems like everything I try doesn't want to bring those short names back.

Help please

Peter_van_der_Weerd · March 20, 2016, 5:38pm

So, I create an index with the settings you supplied, containing the 2 records that you supplied earlier.
Doing the query

curl -XPOST 'http://localhost:9200/martin/_search' -d '
{
  "explain": true,
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "constant_score": { "query": { "match": { "first_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "first_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "last_name": { "query": "Heather Jo"}}}}},
        { "constant_score": { "query": { "match": { "last_name._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "full_name": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "email._startswith": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id": { "query": "Heather Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id._startswith": { "query": "Heather Jo" }}}}}
      ]
    }
  }
}'

This returns the 2 records with 'heather joe' scored as 1.67 and Joe Hernandez scored as 0.67. Which makes totally sense to me.
In the explain I doen't see heather scored by he, hea, heat, etc!!!!

Even a better query: Searching for heather jo means IMHO that heather should not be completed, while jo should.
This leads to the following query:

{
  "explain": true,
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "constant_score": { "query": { "match": { "first_name": { "query": "Heather" }}}}},
        { "constant_score": { "query": { "match": { "first_name._startswith": { "query": "Jo" }}}}},
        { "constant_score": { "query": { "match": { "last_name": { "query": "Heather"}}}}},
        { "constant_score": { "query": { "match": { "last_name._startswith": { "query": "Jo" }}}}},
        { "constant_score": { "query": { "match": { "email": { "query": "Heather" }}}}},
        { "constant_score": { "query": { "match": { "email._startswith": { "query": "Jo" }}}}},
        { "constant_score": { "query": { "match": { "display_id": { "query": "Heather" }}}}},
        { "constant_score": { "query": { "match": { "display_id._startswith": { "query": "Jo" }}}}}
      ]
    }
  }
}'

If I execute that query, I get only 1 result: heather joe

But, before you switch to this approach, you have to make sure that you know what is your error in the original query! In the explain that you send me a few weeks ago, the explain showed that you were searching for edges. That is an indication that either your mapping didn't accept the search_analyzer, or that the analyzer is simply wrong.

Note that with what you supplied it works at my place!

/P

mdessureault · March 21, 2016, 1:25pm

well well - I found the problem and I should be ashamed

As you suspected, the mapping the index was under is not the mapping I thought it was. We put mappings in templates, in order, to be applied to new indices. What I posted here is the last mapping, not the end result of applying all templates. Turns out, there is no way to remove a setting from one template to another. You can just change its value but not remove the attribute (field) altogether. This turned out to leave an analyzer (configured as an ngram) as the default search analyzer which gave the results we were seeing. After adding another template making sure that the result of applying all templates was the mapping I wanted and voila - things started to work just fine.

Thanks Peter for your help and sorry to have wasted your time with a... user error...

Topic		Replies	Views
Issue searching for first_name and last_name when there is a compound name Elasticsearch	2	242	June 16, 2022
Problem when using analyzers (very small data set) Elasticsearch	3	317	July 6, 2017
Search and filter with the lists Elasticsearch	5	798	February 20, 2017
Using Exact Prefix/MatchPhrase Prefix Queries with Ngram Filter Elasticsearch	2	667	September 9, 2020
Case-Insensitive regex-based search for text fields in ES 5.6.3 Elasticsearch	1	416	June 9, 2019

Not able to match on short word

Related topics