Standard tokenizer punctuation symbols removed

Chris_Gatihi · March 13, 2020, 3:36am

Thanks for your response @munasia. My initial thought was that the definition for punctuation that gets removed would be there. But I couldn't find it there. Did you? And then upon looking closer at the text I quoted above, it seems to me that the word boundaries are defined by the Unicode Text Segmentation algorithm but the punctuation removal is done separate from that. At least it seems unclear from the wording whether the punctuation removal is part of the Unicode Text Segmentation algorithm or if it's something separate. My guess is that it's separate but I'm not sure, which is why I posted this question

Topic		Replies	Views
Stop standard tokenizer from splitting on punctuations Elasticsearch	1	452	April 26, 2022
ES Plugin to extend Lucene's Standard Tokenizer Elasticsearch	5	910	July 6, 2017
Tokenizing terms with punctuation Elasticsearch	1	248	July 6, 2017
Need help to understand how standard tokenizer works Elasticsearch	2	443	February 17, 2022
How does Elasticsearch treat punctuation marks on index? Elasticsearch	1	2592	July 6, 2017

Standard tokenizer punctuation symbols removed

Related topics