Avoid stemming of Acronyms?


(apanimesh061) #1

I am using the pattern_capture filter to preserve all the acronyms

PUT test_index/_settings
{
  "index.analysis.filter": {
    "acronym_en_EN": {
      "type": "pattern_capture",
      "patterns": [
        "(?:[a-zA-Z]\\.)+", 
        "((?:[a-zA-Z]\\.)+[a-zA-Z])",
        "((?:[a-zA-Z]\\.)+[s]$)",
        "((?:[a-zA-Z]\\.)+[s][\\.]$)"
        ],
      "preserve_original": true
    }
  }
}

But i noticed that acronyms that end with s or s. are stemmed as there is one stemmer filter also attached to the analyzer. The regular expressions in the filter above for handling s are also not working.

I test the output using this

GET test_index/_analyze?tokenizer=standard&filters=lowercase,acronym_en_EN,apostrophe,porter_stemmer_en_EN&text=u.s.a. u.s. s.w.a.t u.t. 

this gives me

{
   "tokens": [
      {
         "token": "u.s.a",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "u.",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "u.",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "s.w.a.t",
         "start_offset": 12,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "u.t",
         "start_offset": 20,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

Is there any way I can preserve the acronyms ending with s so that for u.s. or u.s I don't get u.?


(Robbie Ogburn) #2

It looks like the porter stemmer is doing that (though unsure why at the moment). When not used, the acronyms come out as you expect. I was able to use the keyword marker token filter to get your current setup to work.

PUT test_index
{
  "index.analysis.filter": {
    "acronym_en_EN": {
      "type": "pattern_capture",
      "patterns": [
        "(?:[a-zA-Z]\\.)+", 
        "((?:[a-zA-Z]\\.)+[a-zA-Z])",
        "((?:[a-zA-Z]\\.)+[s]$)",
        "((?:[a-zA-Z]\\.)+[s][\\.]$)"
        ],
      "preserve_original": true
    },
    "porter_stemmer_en_EN" : {
      "type" : "stemmer",
      "name" : "english"
    },
    "no_stem": {
          "type": "keyword_marker",
          "keywords": [ "u.s" ] 
        }
  }

then

GET test_index/_analyze?tokenizer=standard&filters=lowercase,acronym_en_EN,apostrophe,no_stem,porter_stemmer_en_EN&text=u.s.a. u.s. s.w.a.t u.t.

results in

{
   "tokens": [
      {
         "token": "u.s.a",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "u.s",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "s.w.a.t",
         "start_offset": 12,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "u.t",
         "start_offset": 20,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

(system) #3