Pattern_replace char filter regex

This is my index settings: PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "^([a-zA-Z0-9_]+)-([a-zA-Z0-9_]+)",
"replacement": "$1$2"
}
}
}
}
}

I'm trying to analyze this query:
POST my_index/_analyze
{"analyzer":"my_analyzer","text":"elastic-search"}
POST my_index/_analyze
{"analyzer":"my_analyzer","text":"-search"}

Case 1 works fine but for case 2, i get the token 'search' which is not what i want. I want it to skip it if i don't provide text preceding the hyphen. What am i doing wrong?

I think the issue is the use of the standard tokenizer, which removes the hyphen before the char_filter gets the chance.

Instead, you could use something like the whitespace tokenizer:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "char_filter": [
            "my_char_filter"
          ]
        }
      } ,
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern":"(\\w+)-(\\w+)",
          "replacement": "$1$2"
        }
      }
    }
  }
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.