Standard tokenizer punctuation symbols removed

The ES documentation here [0] states this about the standard tokenizer:

The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.

Where can I find the definition of the exact set of punctuation symbols that get removed (e.g. hyphen, ampersand)?

Thanks,
Chris

[0] Tokenizer reference | Elasticsearch Guide [8.11] | Elastic

The word boundaries for the Unicode Text Segmentation can be found here. HTH

Thanks for your response @munasia. My initial thought was that the definition for punctuation that gets removed would be there. But I couldn't find it there. Did you? And then upon looking closer at the text I quoted above, it seems to me that the word boundaries are defined by the Unicode Text Segmentation algorithm but the punctuation removal is done separate from that. At least it seems unclear from the wording whether the punctuation removal is part of the Unicode Text Segmentation algorithm or if it's something separate. My guess is that it's separate but I'm not sure, which is why I posted this question :slight_smile:

The UTS document defines the word boundary rules. It won't have a list of Elasticsearch punctuation that are removed.

The Standard tokenizer implements the word boundary rules as defined in the UTS document. The actual punctuation removal is done in the Lucene code.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.