Standard tokenizer punctuation symbols removed

Chris_Gatihi · March 12, 2020, 11:34pm

The ES documentation here [0] states this about the standard tokenizer:

The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.

Where can I find the definition of the exact set of punctuation symbols that get removed (e.g. hyphen, ampersand)?

Thanks,
Chris

[0] Tokenizer reference | Elasticsearch Guide [8.11] | Elastic

munasia · March 12, 2020, 11:55pm

The word boundaries for the Unicode Text Segmentation can be found here. HTH

Chris_Gatihi · March 13, 2020, 3:36am

Thanks for your response @munasia. My initial thought was that the definition for punctuation that gets removed would be there. But I couldn't find it there. Did you? And then upon looking closer at the text I quoted above, it seems to me that the word boundaries are defined by the Unicode Text Segmentation algorithm but the punctuation removal is done separate from that. At least it seems unclear from the wording whether the punctuation removal is part of the Unicode Text Segmentation algorithm or if it's something separate. My guess is that it's separate but I'm not sure, which is why I posted this question

munasia · March 13, 2020, 3:38pm

The UTS document defines the word boundary rules. It won't have a list of Elasticsearch punctuation that are removed.

The Standard tokenizer implements the word boundary rules as defined in the UTS document. The actual punctuation removal is done in the Lucene code.

system · April 10, 2020, 3:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES Plugin to extend Lucene's Standard Tokenizer Elasticsearch	5	860	July 6, 2017
Stop standard tokenizer from splitting on punctuations Elasticsearch	1	380	April 26, 2022
Configuring the standard tokenizer Elasticsearch	8	15344	July 5, 2017
Remove leading and trailing punctuation from terms Elasticsearch	3	595	April 4, 2019
How does Elasticsearch treat punctuation marks on index? Elasticsearch	1	2549	July 6, 2017

Standard tokenizer punctuation symbols removed

Related topics