I'm trying to build a process to merge a bunch of records. I have an index of music artists that is not throughly cleansed, I'd like to build a process that loops over each artist and finds similarly spelt ones. The plan is to take these relationships and allow for a user to review them and potentially say "beyonce" and "beyoncé" are the same artist (bad example).
I'm having trouble doing this using the _score value due to inverse term frequency. e.g. If I search for "A midsummer nights dream" on the following documents.
A MIDSUMMER NIGHT'S DREAM
A MIDSUMMER NIGHTS DREAM
A MIDSUMMER NIGHT´S DREAM
A MIDSUMMER'S NIGHT'S DREAM
A NARRATED MIDSUMMER NIGHT'S DREAM
The "NARRATED" version appears higher than some of the other results due to the rarity of "narrated".
My query looks like this:
GET artists/artist/_search
{
"query": {
"match": {
"name": {
"query": "A MIDSUMMER NIGHTS DREAM",
"fuzziness": 3
}
}
}
}
I'd like to base the score on perhaps the number of tokens that match the input query, is such a thing possible? I cant find much in the documentation.