Stop standard tokenizer from splitting on punctuations

Nipun_Tulsyan · March 29, 2022, 11:28am

Hi

We use the "standard" tokenizer in custom analyzer definitions. By default
the standard tokenizer splits words on hyphens and ampersands, so for
example "i-mac" is tokenized to "i" and "mac"

Is there any way to configure the behaviour of the standard tokenizer to
stop it splitting words on all punctuations except comma(","), while still doing all
the other tokenizing it does.

Or maybe define a custom tokenizer which will achieve the above
For example:
query string ["n-12"] should create a this token:

{
  "tokens" : [
    {
      "token" : "n-12",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
} 
instead of 
{
    "tokens" : [
      {
        "token" : "n",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "<ALPHANUM>",
        "position" : 0
      },
      {
        "token" : "12",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "<ALPHANUM>",
        "position" : 0
      }
    ]
}

Regards,
Nipun

system · April 26, 2022, 11:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Configuring the standard tokenizer Elasticsearch	8	15465	July 5, 2017
Standard tokenizer documentation doesn't match behavior Elasticsearch	2	328	July 6, 2017
Overwrite Tokenizer of english analyzer Elasticsearch	3	624	July 5, 2017
Tokenizing terms with punctuation Elasticsearch	1	239	July 6, 2017
Configuring the standard tokenizer elasticsearch Elasticsearch	2	474	October 30, 2018

Stop standard tokenizer from splitting on punctuations

Related topics