Custom analyzer: keyword_marker


#1

I'm using a custom analyzer that protects some keywords from being stemmed.

The functionality works as expected, the keyword is not being stemmed, BUT for some reason a second analyzed version of the keyword is being produced:

Here's the behaviour when using the _analyze endpoint:

GET /myIndex/_analyze?analyzer=germanComp
{
"AIDS"
}

{
"tokens": [
{
"token": "aids",
"start_offset": 7,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "aid",
"start_offset": 7,
"end_offset": 10,
"type": "",
"position": 1
}
]
}

The first token is perfect, but I'd like to get rid of the second one.

Any idea how I can achieve it?

Here's my custom analyzer:

"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "german"
},
"itm_keywords": {
"type" : "keyword_marker",
"keywords" : ["aids"]
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
},
"unify": {
"type": "unique",
"only_on_same_position": true
}
},
"analyzer": {
"germanComp": {
"tokenizer": "standard",
"filter": [
"lowercase",
"itm_keywords",
"german_stop",
"german_normalization",
"german_stemmer",
"unify"
]
}


(system) #2