Need help to understand how standard tokenizer works

In the following two examples, I had hard time to understand why the outputs are different:

POST _analyze
{
  "tokenizer": "standard",
  "text": "mark.cuban"
}

terms produced: ["mark.cuban"]

POST _analyze
{
  "tokenizer": "standard",
  "text": "mark1.cuban"
}

terms produced: ["mark1","cuban"]

why the "." is treated as tokenizer in the second example, but not in the first?

I'm not familiar with tokenizer, but Table 3. Word_Break Property Values in the latter page says, U+002E ( . ) FULL STOP is treated as a word boundary for MidNumLet but not for MidLetter. I have no idea about the reason.

https://unicode.org/reports/tr29/

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.