Avoid stemming of Acronyms?

apanimesh061 · September 1, 2015, 7:08pm

I am using the pattern_capture filter to preserve all the acronyms

PUT test_index/_settings
{
  "index.analysis.filter": {
    "acronym_en_EN": {
      "type": "pattern_capture",
      "patterns": [
        "(?:[a-zA-Z]\\.)+", 
        "((?:[a-zA-Z]\\.)+[a-zA-Z])",
        "((?:[a-zA-Z]\\.)+[s]$)",
        "((?:[a-zA-Z]\\.)+[s][\\.]$)"
        ],
      "preserve_original": true
    }
  }
}

But i noticed that acronyms that end with s or s. are stemmed as there is one stemmer filter also attached to the analyzer. The regular expressions in the filter above for handling s are also not working.

I test the output using this

GET test_index/_analyze?tokenizer=standard&filters=lowercase,acronym_en_EN,apostrophe,porter_stemmer_en_EN&text=u.s.a. u.s. s.w.a.t u.t.

this gives me

{
   "tokens": [
      {
         "token": "u.s.a",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "u.",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "u.",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "s.w.a.t",
         "start_offset": 12,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "u.t",
         "start_offset": 20,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

Is there any way I can preserve the acronyms ending with s so that for u.s. or u.s I don't get u.?

n0othing · September 2, 2015, 1:32am

It looks like the porter stemmer is doing that (though unsure why at the moment). When not used, the acronyms come out as you expect. I was able to use the keyword marker token filter to get your current setup to work.

PUT test_index
{
  "index.analysis.filter": {
    "acronym_en_EN": {
      "type": "pattern_capture",
      "patterns": [
        "(?:[a-zA-Z]\\.)+", 
        "((?:[a-zA-Z]\\.)+[a-zA-Z])",
        "((?:[a-zA-Z]\\.)+[s]$)",
        "((?:[a-zA-Z]\\.)+[s][\\.]$)"
        ],
      "preserve_original": true
    },
    "porter_stemmer_en_EN" : {
      "type" : "stemmer",
      "name" : "english"
    },
    "no_stem": {
          "type": "keyword_marker",
          "keywords": [ "u.s" ] 
        }
  }

then

GET test_index/_analyze?tokenizer=standard&filters=lowercase,acronym_en_EN,apostrophe,no_stem,porter_stemmer_en_EN&text=u.s.a. u.s. s.w.a.t u.t.

results in

{
   "tokens": [
      {
         "token": "u.s.a",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "u.s",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "s.w.a.t",
         "start_offset": 12,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "u.t",
         "start_offset": 20,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

Topic		Replies	Views
Stemming acronyms ending in "s"; keyword marker token filter; minimal english stemmer Elasticsearch	3	739	July 6, 2017
Phrase match on an index analyzed with stemmer Elasticsearch	1	524	June 6, 2017
Search: Removing full stop if part of acronym / abbreviation with pattern_replace character filter Elasticsearch	1	208	May 10, 2023
Pattern replace apostrophes? Elasticsearch	4	886	July 6, 2017
Pattern_replace char filter regex Elasticsearch	2	707	June 28, 2017

Avoid stemming of Acronyms?

Related topics