Remove leading and trailing punctuation from terms

bram · March 6, 2019, 9:37am

How can I only remove the leading and trailing punctuation from terms when indexing?

The standard tokenizer also splits words like "t-shirt" into "t" and "shirt", and that is not desired in our use case. But we want to turn "t-shirt." into "t-shirt".

Is there a filter available that does this?

abdon · March 7, 2019, 10:30am

You could use a character filter that changes a hyphen in the middle of a word into an underscore. Underscores are not removed by the standard tokenizer. Something like this:

GET _analyze
{
  "text": [
    "t-shirt t-shirt."
  ],
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "(\\w+)-(?=\\w)",
      "replacement": "$1_"
    }
  ]
}

bram · March 7, 2019, 11:48am

Thanks , that has fixed our problems; together with a patternreplace as the first term filter to make sure the worddelimiter keeps working (for the terms "t-shirt", "t", "shirt" and "tshirt").

system · April 4, 2019, 11:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pattern_replace char filter regex Elasticsearch	2	707	June 28, 2017
Searching for exactly a hyphenated word Elasticsearch	3	12054	January 15, 2019
Standard tokenizer punctuation symbols removed Elasticsearch	4	1210	April 10, 2020
Configuring the standard tokenizer Elasticsearch	8	15242	July 5, 2017
How to Index Words Actual form and Modified form into Elastic Search Elasticsearch	4	404	November 29, 2018

Remove leading and trailing punctuation from terms

Related topics