I want to create an index with a stemmer analyzer to generalize my synonyms and apply it in other analyzers.
For a simplified example: I want to use all these synonyms [beautiful, pretty, beauteous, gorgeous]
in multiple analyzers when searching for beauty
, once beauty and beautiful have the same stem word
GET /_analyzer
{
"tokenizer": "standard",
"filter": [ "stemmer" ],
"text": "beautiful beauty"
}
{
"tokens": [
{
"token": "beauti", ...
},
{
"token": "beauti", ...
}
]
}
What I have so far is
PUT /test_synonyms
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": ["beautiful, pretty, beauteous, gorgeous"]
},
"my_metaphone": {
"type": "phonetic",
"encoder": "metaphone",
"replace": true
}
}
}
}
}
}
Stemmer analyzer gives me:
GET /test_synonyms/_analyzer
{
"tokenizer": "standard",
"filter": ["stemmer", "my_synonyms"],
"text": "beauty"
}
{
"tokens": [
{
"token": "beauti", ...
},
{
"token": "pretti", ...
},
{
"token": "beauteo", ...
},
{
"token": "gorgeou", ...
}
]
}
Phonetic analyzer gives me:
GET /test_synonyms/_analyzer
{
"tokenizer": "standard",
"filter": ["my_synonyms", "my_metaphone"],
"text": "beauty"
}
{
"tokens": [
{
"token": "BT", ...
}
]
}
Once "BT" doesn't match with any of the tokens:
GET /test_synonyms/_analyzer
{
"tokenizer": "standard",
"filter": ["my_synonyms", "my_phonetic"],
"text": "beautiful"
}
{
"tokens": [
{
"token": "BTFL", ... /*beautiful*/
},
{
"token": "PRT", ... /*pretty*/
},
{
"token": "BTS", ... /*beauteous*/
},
{
"token": "KRJS", ... /*gorgeous*/
}
]
}
I was wondering if there is a way to return the exact synonym words (not their stem), but still use stemmer to find them, and then use this with other analyzers.. Something to give me the response above when searching for beauty
I tried to use the stemmer and phonetic filters together, but it gives me:
GET /test_synonyms/_analyzer
{
"tokenizer": "standard",
"filter": ["stemmer", "my_synonyms", "my_phonetic"],
"text": "beauty" /*or beautiful (equal responses)*/
}
{
"tokens": [
{
"token": "BT", ... /*beauti*/
},
{
"token": "PRT", ... /*pretti*/
},
{
"token": "BT", ... /*beauteou*/
},
{
"token": "KRJ", ... /*gorgeou*/
}
]
}
And this isn't what I really want, cuz when I search for "beautiful" and "beauty", the number of documents returned are differents (beautiful score the phonetic matches), and I want them to be the same.