Word_delimiter hyphen remove but retain

Hi

I would like to retain the hyphen in between the words but remove at the beginning or at the end of the word.

For word -ecigarette I would like the hyphen to be removed and search for "ecigarette" return a result, however for e-cigarette I would like search "ecigarette" did not return results.

When I use "type_table":["- => ALPHA"]
for my custom alanyzer then the hyphen retains all the time and searches for "ecigarette" does not return resutls, when I do not use the "type_table" then the phrase is split into two tokens
e
cigarette
and search for cigarette also return results which for my settings is not correct.

for words like
-e-cigarette
how can I create token:
"e-cigarette"


my custom analyzer:
 "custom_word_delimiter_filter": {
            "split_on_numerics": "false",
            "generate_word_parts": "true",
            "catenate_words": "true",
            "generate_number_parts": "true",
            "catenate_all": "false",
            "split_on_case_change": "true",
            "type": "word_delimiter",
            "catenate_numbers": "false",
            "stem_english_possessive": "false",
          "type_table":[" - => ALPHA"]
....
"my_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
           
            "filter": [
              "my_synonyms",
              "custom_word_delimiter_filter",
              "lowercase"
            ]
          }

thank you in advance for any suggestions

Damian

Hello @Damian,

Welcome to the community!

From my understanding, you want to create tokens retaining the hyphen between the word but removing it at the beginning or end of the word.

To achieve this, you can consider adding the pattern_replace character filter. This filter utilizes Java Regular Expressions.

You can try something like this:

{
"type": "pattern_replace",
"pattern": "(^-|-$)",
"replacement": ""
}

In this case, the goal is for the pattern to match the hyphen at the beginning (^-) or end (-$) of the line and replace it with an empty string.

Screen Shot 2023-05-29 at 3.14.49 PM

You can try incorporating a similar filter into your existing config.

Example:

PUT test-analyzer
{
  "settings": {
      "analysis" : {
        "analyzer" : {
          "my_analyzer": {
            "type" : "custom",
            "tokenizer" : "whitespace",
            "char_filter" : ["my_char_filter"]
          }
          },
            "char_filter" : {
              "my_char_filter":{
              "type": "pattern_replace",
              "pattern": "(^-|-$)",
              "replacement": ""
          }
            }
      }
  }
  }
GET test-analyzer/_analyze
{
  "analyzer": "my_analyzer",
  "text": "-e-cigarette"
}

Token:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.