The ES documentation here [0] states this about the standard tokenizer:
The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.
Where can I find the definition of the exact set of punctuation symbols that get removed (e.g. hyphen, ampersand)?
Thanks for your response @munasia. My initial thought was that the definition for punctuation that gets removed would be there. But I couldn't find it there. Did you? And then upon looking closer at the text I quoted above, it seems to me that the word boundaries are defined by the Unicode Text Segmentation algorithm but the punctuation removal is done separate from that. At least it seems unclear from the wording whether the punctuation removal is part of the Unicode Text Segmentation algorithm or if it's something separate. My guess is that it's separate but I'm not sure, which is why I posted this question
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.