I have a file of alternate spellings for the terms in my index. I want to produce bigrams containing those alternate spellings for particular terms. For example, I have biriyani, biryani, briyani
in my alternate spellings csv file and my field contains the text Chicken Biryani
. I want to be able to produce chicken biryani, chicken biriyani, chicken briyani
tokens.
Now, if I use a whitespace tokenizer with a synonym filter, the following tokens are generated chicken, biriyani, biryani, briyani
which is expected. Now if I apply a shingle filter then, the tokens generated are chicken, chicken biryani, biryani, biryani biriyani, biriyani, biriyani briyani, briyani
. This token stream contains shingles of synonyms of the word itself which should not be there and it does not contain tokens with chicken [alternate spellings of biryani]
like chicken biriyani or chicken briyani, etc. If I place shingle filter before the synonym filter, then it only adds synonym tokens for the unigram: chicken, chicken biryani, biriyani, biryani, briyani
. Is there a way to generate tokens that contain synonyms at the same position as the original token, or in this case chicken biryani, chicken biriyani, chicken briyani
I am running Elasticsearch 5.6. Sample settings for testing:
PUT test_bigram
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"biriyani, biryani, briyani"
]
}
},
"analyzer": {
"synonym_analyzer": {
"filter": [
"synonym"
],
"type": "custom",
"tokenizer": "whitespace"
},
"shingle_synonym": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"shingle",
"synonym"
]
},
"synonym_shingle": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"synonym",
"shingle"
]
}
}
}
}
}
}