I'm playing with Elasticsearch for future implementation in a production environment.
My problem is that I need to use fuzzy search and phonetics to achieve my objective, as follows:
- Query using fuzzy matching
GET _search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"type": "most_fields",
"query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
"fuzzy_transpositions": "true",
"fuzziness": "AUTO",
"fields": ["artist_name", "title_track"],
"slop": 100,
"max_expansions": 30
}
},
{
"multi_match": {
"type": "cross_fields",
"query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
"fields": ["artist_name", "title_track"],
"boost": 5,
"operator": "and",
"max_expansions": 30
}
}]
}
}
}
- The results are pretty good, even when messing the string up as on the query:
{
"took": 316,
"timed_out": false,
"_shards": {
"total": 11,
"successful": 11,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1169343,
"max_score": 26.201363,
"hits": [
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "zVzFm2gB0djhmNXkB5y-",
"_score": 26.201363,
"_source": {
"title_track": "HEY JUDE",
"album_id": null,
"artist_id": 38387,
"artist_name": """"BEATLES, THE""""
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "X1ETmmgB0djhmNXkARTQ",
"_score": 26.201363,
"_source": {
"title_track": "HEY JUDE",
"album_id": null,
"artist_id": 21183,
"artist_name": "THE BEATLES"
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "MF34m2gB0djhmNXkTvIn",
"_score": 26.080318,
"_source": {
"title_track": "HEY JUDE",
"album_id": 6135978,
"artist_id": 40333,
"artist_name": "BEATLES, THE"
}
},
...
- The problem begins when I don't have an indexed artist and/or track:
GET _search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"type": "most_fields",
"query": "justin bieber - sorry",
"fuzzy_transpositions": "true",
"fuzziness": "AUTO",
"fields": ["artist_name", "title_track"],
"slop": 100,
"max_expansions": 30
}
},
{
"multi_match": {
"type": "cross_fields",
"query": "justin bieber - sorry",
"fields": ["artist_name", "title_track"],
"boost": 5,
"operator": "and",
"max_expansions": 30
}
}]
}
}
}
- The results are not returning Justin Bieber since it's not indexed
{
"took": 121,
"timed_out": false,
"_shards": {
"total": 11,
"successful": 11,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 19730,
"max_score": 24.51635,
"hits": [
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "-XfOn2gB0djhmNXkENiE",
"_score": 24.51635,
"_source": {
"title_track": "JUSTIN",
"album_id": 5897467,
"artist_id": 117964,
"artist_name": "JUSTIN"
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "yXfOn2gB0djhmNXkCdjW",
"_score": 24.42126,
"_source": {
"title_track": "JUSTIN",
"album_id": null,
"artist_id": 117964,
"artist_name": "JUSTIN"
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "iDxal2gB0djhmNXkY_ew",
"_score": 23.26923,
"_source": {
"title_track": "JUSTIN BIEBER",
"album_id": null,
"artist_id": 10851,
"artist_name": "SMASH MOUTH"
}
},
...
The goal is to know if an artist and track are indexed. I need the results as accurate as possible, but still using fuzziness to cover misspellings.
My idea is to use the phonetics plugin with metaphone to post-process the retrieved documents and the input string, and this way define if the generated metaphone for the documents are present on the metaphone for the input string.
I was hoping that I could provide one query and Elasticsearch could return all this information on the same result set, or even tell me if a match was found or not.
I could only use the phonetics string calling:
GET phonetic/_analyze
{
"analyzer": "phonetic",
"text": "The Beatles – Hello Goodbye"
}
or
GET /phonetic/phonetic/_search
{
"query": {
"match": {
"user.phonetic": {
"query":"beatles"
}
}
}
}
This is far, far away from what I need, since I could not use phonetics and fuzzy search at the same field :\
Here's how the phonetics analiser and filter were created:
PUT /phonetic
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
}
PUT /phonetic/_mapping/phonetic
{
"properties": {
"user": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "dbl_metaphone"
}
}
}
}
}
I found no more detailed material about the phonetics plugin for Elasticsearch or how to use it on scripts, for example (the idea in this case is to post-process each document and generate phonetics for each token, then compare them against each word on the search string).
I could write an external program to receive and process Elasticsearch's results, but this would be too clunky since now I would have two APIs, one calling the other (I still need to serve the results via API).
To summarize, I need to make sure that an artist and track are indexed, but at the same time I need to accept misspellings.
Many thanks in advance.