Need help to understand how standard tokenizer works

jsjs2 · January 19, 2022, 6:07pm

In the following two examples, I had hard time to understand why the outputs are different:

POST _analyze
{
  "tokenizer": "standard",
  "text": "mark.cuban"
}

terms produced: ["mark.cuban"]

POST _analyze
{
  "tokenizer": "standard",
  "text": "mark1.cuban"
}

terms produced: ["mark1","cuban"]

why the "." is treated as tokenizer in the second example, but not in the first?

Tomo_M · January 20, 2022, 1:12am

I'm not familiar with tokenizer, but Table 3. Word_Break Property Values in the latter page says, U+002E ( . ) FULL STOP is treated as a word boundary for MidNumLet but not for MidLetter. I have no idea about the reason.

https://unicode.org/reports/tr29/

system · February 17, 2022, 1:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Configuring the standard tokenizer elasticsearch Elasticsearch	2	449	October 30, 2018
Problems with Tokenization Elasticsearch	3	646	October 26, 2017
Standard tokenizer documentation doesn't match behavior Elasticsearch	2	316	July 6, 2017
A number followed by a dot is considered a word break? Elasticsearch	5	12	November 4, 2024
ES Plugin to extend Lucene's Standard Tokenizer Elasticsearch	5	855	July 6, 2017

Need help to understand how standard tokenizer works

Related topics