Elasticseach: Default Similairty Algorithm and BM25 giving same results


(Rahul Nama) #1

Hi Team

I'm trying to understand which similarity algorithm is best suited for our use-case:
1.TF-IDF(default)
2.BM25

I've created two indices with mappings of both algorithms, but when I query, two indices are giving same documents with same score.

Mappings for Index 1:(default algorithm)

{
  "singleindex": {
"aliases": {},
"mappings": {
  "people": {
    "properties": {
      "Application-Name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "Author": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "Character Count": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "Content-Type": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "Creation-Date": {
        "type": "date"
      },
      "Last-Modified": {
        "type": "date"
      },
      "Page-Count": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "Word-Count": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "meta:last-author": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "tika": {
        "properties": {
          "mime": {
            "properties": {
              "file": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      },
      "xmpTPg:NPages": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
},
"settings": {
  "index": {
    "creation_date": "1539597307747",
    "number_of_shards": "5",
    "number_of_replicas": "1",
    "uuid": "9jzKHct4T3qjI_EfFPFnzg",
    "version": {
      "created": "6020399"
    },
    "provided_name": "singleindex"
  }
}
  }
}

Mappings for Index 2:(BM25)

   {
  "tes_index": {
    "aliases": {},
    "mappings": {
      "people": {
        "properties": {
          "Application-Name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Author": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "similarity": "BM25"
              }
            }
          },
          "Character Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Content-Type": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Creation-Date": {
            "type": "date"
          },
          "Last-Modified": {
            "type": "date"
          },
          "Page-Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Word-Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:last-author": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "xmpTPg:NPages": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": "5",
        "provided_name": "tes_index",
        "similarity": {
          "default": {
            "type": "BM25"
          }
        },
        "creation_date": "1539599395035",
        "number_of_replicas": "1",
        "uuid": "BFDAL1XuQ1O36KLIMDT2mw",
        "version": {
          "created": "6020399"
        }
      }
    }
  }
}

You can see in the settings. Please suggest what I'm missing here


(Rahul Nama) #2

Hello team

Can someone please help.


(David Pilato) #3

Read this and specifically the "Also be patient" part.

It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.


(Rahul Nama) #4

@dadoonet
Thank you. I should have read this earlier.

Will follow the guidelines strictly.


(Abdon Pijpelink) #5

BM25 is already the default similarity. It has been since version 5.0. Before that, it used to be TF/IDF. If you want to use TF/IDF, you would have to configure an index to use the classic similarity.

It may not be worth spending a lot of time testing TF/IDF though, as it has been deprecated.


(Rahul Nama) #6

Thanks @abdon
So Elasticsearch by default uses BM25 ?


(Abdon Pijpelink) #7

Yes, since version 5.0


(Rahul Nama) #8

Okay Great.

Thanks for your time @abdon

Do I need to consider or Does ES supports Any algorithms which perform better than BM25?


(Abdon Pijpelink) #9

"Better" is a very subjective term. There is a reason that Lucene and Elasticsearch use BM25 as the default: it seems to work well for a lot of use cases. Having said that, there are a number of other similarity algorithms available that may work better in certain use cases. You can find a list of all available similarities in the Elasticsearch documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html

Maybe you find this video on our website interesting: https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 . It discusses the switch from TF/IDF to BM25.


(Rahul Nama) #10

@abdon Thanks for the reference links--helpful. What I observed is BM25 is better suited for most of the use cases. Though I didn't get how it calculates the probability of relevant documents from it's formula. Need to spend some time and should give it one more try.

We are having long text fields more than 15 lines for each document. Is there any algorithm which suits for long text fields? If not it's good to go with BM25.

Thanks for your time as always :slight_smile:


(Abdon Pijpelink) #11

I would say that generally BM25 is great for longer fields. It is really designed for that full-text search use case.

If you want to learn more about BM25, we have a great 3-part blog series on our website:


(Rahul Nama) #12

@abdon

Good to hear. Thank you so much

will go through the blogs. that should help

Thanks again :slight_smile:

-- Rahul


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.