Huge difference in ngram scoring after upgrading to 5.0


(Kjartan Bjørset) #1

I am in the process of upgrading from 2.4 to 5.0 and have seen my query results change a lot given the same query and the same indexing routines.

In short I have identified the perpetrator to be the scoring of my trigrams in Elasticsearch 5.0, which has changed significantly from 2.4. When using "explain" : true, the output has changed quite a bit. In 2.4 each trigram term is treated separately and the sum of the terms is used to compute the score. In 5.0 this is not the case. A "synonym" part has suddenly popped up, as shown below:

2.4 ouput:

"_explanation": {
  "value": 6.958956,
  "description": "sum of:",
  "details": [
    {
      "value": 0.25626373,
      "description": "sum of:",
      "details": [
        {
          "value": 0.25626373,
          "description": "weight(identifying.trigram:vei in 79406) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.25626373,
              "description": "score(doc=79406,freq=1.0), product of:",
              "details": [

           ....

5.0 output:

{
  "value": 5.197066,
  "description": "sum of:",
  "details": [
    {
      "value": 0.46604955,
      "description": "weight(Synonym(identifying.trigram:eit identifying.trigram:ita identifying.trigram:vei) in 65352) [PerFieldSimilarity], result of:",
      "details": [
        {
          "value": 0.46604955,
          "description": "score(doc=65352,freq=1.0), product of:",
          "details": [
    
       ....

I've only included the interesting parts of the output here, and the particularly important part seems to be this:

weight(**Synonym(**identifying.trigram:eit identifying.trigram:ita identifying.trigram:vei)

I do not understand where this "synonym" comes from. I've never included it in my index settings, and the only thing I can find in the documentation is talking about synonym token filters. This is the case no matter what I do at index creation time...(https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-synonym-tokenfilter.html)

I have already turned off the new "BM25" similarity scoring (using TF/IDF as for ES 2.4), but that does not change anything. It really messes with my search results and I'm struggling with keeping the same quality of service in 5.0.

So the question is: Is this a bug? A new feature I cannot switch off (e.g. ngram filter implementation)? Or is there something else I am missing here?

Here's my analyzer and tokenizer for the identifying field:

Analyzer:

"identifying_trigram_analyzer" :  {
    "type": "custom",
    "tokenizer": "standard",
    "filter" : [
        "lowercase",
        "asciifolding",
        "trigram_filter"
        ]
}

Trigram filter:

"trigram_filter": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 3
}

Any help would be greatly appreaciated

Kind regards,
desperate developer


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.