I am using the pattern_capture
filter to preserve all the acronyms
PUT test_index/_settings
{
"index.analysis.filter": {
"acronym_en_EN": {
"type": "pattern_capture",
"patterns": [
"(?:[a-zA-Z]\\.)+",
"((?:[a-zA-Z]\\.)+[a-zA-Z])",
"((?:[a-zA-Z]\\.)+[s]$)",
"((?:[a-zA-Z]\\.)+[s][\\.]$)"
],
"preserve_original": true
}
}
}
But i noticed that acronyms that end with s
or s.
are stemmed as there is one stemmer filter also attached to the analyzer. The regular expressions in the filter above for handling s
are also not working.
I test the output using this
GET test_index/_analyze?tokenizer=standard&filters=lowercase,acronym_en_EN,apostrophe,porter_stemmer_en_EN&text=u.s.a. u.s. s.w.a.t u.t.
this gives me
{
"tokens": [
{
"token": "u.s.a",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "u.",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "u.",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "s.w.a.t",
"start_offset": 12,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "u.t",
"start_offset": 20,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
}
]
}
Is there any way I can preserve the acronyms ending with s
so that for u.s.
or u.s
I don't get u.
?