Elasticsearch post-processing using phonetics

#1

I'm playing with Elasticsearch for future implementation in a production environment.
My problem is that I need to use fuzzy search and phonetics to achieve my objective, as follows:

  • Query using fuzzy matching
GET _search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "type": "most_fields", 
            "query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
            "fuzzy_transpositions": "true", 
            "fuzziness": "AUTO", 
            "fields": ["artist_name", "title_track"],
            "slop": 100,
            "max_expansions": 30
          }
        },
        {
          "multi_match": {
            "type": "cross_fields", 
            "query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
            "fields": ["artist_name", "title_track"],
            "boost": 5, 
            "operator": "and",
            "max_expansions": 30
          }
        }]
}
}
}
  • The results are pretty good, even when messing the string up as on the query:
{
  "took": 316,
  "timed_out": false,
  "_shards": {
    "total": 11,
    "successful": 11,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1169343,
    "max_score": 26.201363,
    "hits": [
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "zVzFm2gB0djhmNXkB5y-",
        "_score": 26.201363,
        "_source": {
          "title_track": "HEY JUDE",
          "album_id": null,
          "artist_id": 38387,
          "artist_name": """"BEATLES, THE""""
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "X1ETmmgB0djhmNXkARTQ",
        "_score": 26.201363,
        "_source": {
          "title_track": "HEY JUDE",
          "album_id": null,
          "artist_id": 21183,
          "artist_name": "THE  BEATLES"
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "MF34m2gB0djhmNXkTvIn",
        "_score": 26.080318,
        "_source": {
          "title_track": "HEY JUDE",
          "album_id": 6135978,
          "artist_id": 40333,
          "artist_name": "BEATLES, THE"
        }
      },
...

  • The problem begins when I don't have an indexed artist and/or track:
GET _search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "type": "most_fields", 
            "query": "justin bieber - sorry",
            "fuzzy_transpositions": "true", 
            "fuzziness": "AUTO", 
            "fields": ["artist_name", "title_track"],
            "slop": 100,
            "max_expansions": 30
          }
        },
        {
          "multi_match": {
            "type": "cross_fields", 
            "query": "justin bieber - sorry",
            "fields": ["artist_name", "title_track"],
            "boost": 5, 
            "operator": "and",
            "max_expansions": 30
          }
        }]
}
}
}
  • The results are not returning Justin Bieber since it's not indexed
{
  "took": 121,
  "timed_out": false,
  "_shards": {
    "total": 11,
    "successful": 11,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 19730,
    "max_score": 24.51635,
    "hits": [
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "-XfOn2gB0djhmNXkENiE",
        "_score": 24.51635,
        "_source": {
          "title_track": "JUSTIN",
          "album_id": 5897467,
          "artist_id": 117964,
          "artist_name": "JUSTIN"
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "yXfOn2gB0djhmNXkCdjW",
        "_score": 24.42126,
        "_source": {
          "title_track": "JUSTIN",
          "album_id": null,
          "artist_id": 117964,
          "artist_name": "JUSTIN"
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "iDxal2gB0djhmNXkY_ew",
        "_score": 23.26923,
        "_source": {
          "title_track": "JUSTIN BIEBER",
          "album_id": null,
          "artist_id": 10851,
          "artist_name": "SMASH MOUTH"
        }
      },
...

The goal is to know if an artist and track are indexed. I need the results as accurate as possible, but still using fuzziness to cover misspellings.

My idea is to use the phonetics plugin with metaphone to post-process the retrieved documents and the input string, and this way define if the generated metaphone for the documents are present on the metaphone for the input string.
I was hoping that I could provide one query and Elasticsearch could return all this information on the same result set, or even tell me if a match was found or not.

I could only use the phonetics string calling:

GET phonetic/_analyze
{
  "analyzer": "phonetic",
  "text": "The Beatles – Hello Goodbye"
} 

or

GET /phonetic/phonetic/_search
{
    "query": {
        "match": {
            "user.phonetic": {
                "query":"beatles"
            }
        }
    }
}

This is far, far away from what I need, since I could not use phonetics and fuzzy search at the same field :\

Here's how the phonetics analiser and filter were created:

PUT /phonetic
{
  "settings": {
    "analysis": {
      "filter": {
        "dbl_metaphone": {
          "type":    "phonetic",
          "encoder": "double_metaphone"
        }
      },
      "analyzer": {
        "dbl_metaphone": {
          "tokenizer": "standard",
          "filter":    "dbl_metaphone"
        }
      }
    }
  }
}

PUT /phonetic/_mapping/phonetic
{
  "properties": {
    "user": {
      "type": "text",
      "fields": {
        "phonetic": {
          "type":     "text",
          "analyzer": "dbl_metaphone"
        }
      }
    }
  }
}

I found no more detailed material about the phonetics plugin for Elasticsearch or how to use it on scripts, for example (the idea in this case is to post-process each document and generate phonetics for each token, then compare them against each word on the search string).

I could write an external program to receive and process Elasticsearch's results, but this would be too clunky since now I would have two APIs, one calling the other (I still need to serve the results via API).

To summarize, I need to make sure that an artist and track are indexed, but at the same time I need to accept misspellings.

Many thanks in advance.

(system) closed #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.