Remove leading and trailing punctuation from terms

How can I only remove the leading and trailing punctuation from terms when indexing?

The standard tokenizer also splits words like "t-shirt" into "t" and "shirt", and that is not desired in our use case. But we want to turn "t-shirt." into "t-shirt".

Is there a filter available that does this?

You could use a character filter that changes a hyphen in the middle of a word into an underscore. Underscores are not removed by the standard tokenizer. Something like this:

GET _analyze
{
  "text": [
    "t-shirt t-shirt."
  ],
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "(\\w+)-(?=\\w)",
      "replacement": "$1_"
    }
  ]
}

Thanks :smile:, that has fixed our problems; together with a patternreplace as the first term filter to make sure the worddelimiter keeps working (for the terms "t-shirt", "t", "shirt" and "tshirt").

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.