Inconsistent number of tokens with uax_url_email

I would like the first tokenizing below (#1) to result in 1 token. The only difference to #2 is that the domain starts with a number. Any feedback is greatly appreciated. Thank you.

#1. Results in 2 tokens:
curl -XGET 'http://localhost:9200/_analyze?tokenizer=uax_url_email&text=www.123abc.com'

{"tokens":[{"token":"www","start_offset":0,"end_offset":3,"type":"","position":0},{"token":"123abc.com","start_offset":4,"end_offset":14,"type":"","position":1}]}

#2. Results in 1 token:
curl -XGET 'http://localhost:9200/_analyze?tokenizer=uax_url_email&text=www.a123bc.com'
{"tokens":[{"token":"www.a123bc.com","start_offset":0,"end_offset":14,"type":"","position":0}]}

Hey,

The way I understand this you need to provide a real URL which contains a scheme like http:// in order to trigger make use of the tokenizer. You can easily reproduce this by not specifying the tokenizer and getting the exact same results. Those differ when using a real URL.

--Alex

Hi Alex,

Thank you for your answer. That helps us understand the tokenizer and find a solution since out data/urls do not have schemes.

Joey