Uax_url_email tokenizer unexpected result

I define a sub-field like so:

        "tokenizer": {
            "my_email_tokenizer": {
                "type": "uax_url_email",
                "max_token_length": 100,
            }
        },
        ....
            "my_email_analyzer": {
                "type": "custom",
                "tokenizer": "my_email_tokenizer",
                "filter": ["lowercase", "stop","length_filter"]
            },
            ...
            "fields": {
                "emails":{
                   "type":"text",
                   "analyzer":"my_email_analyzer", 
                },

However when I try and analyze the email "foobar@baz.mail" against this field, the result is:

{'tokens': [{'end_offset': 13,
'position': 0,
'start_offset': 0,
'token': 'foobar@baz.ma',
'type': ''},
{'end_offset': 15,
'position': 1,
'start_offset': 13,
'token': 'il',
'type': ''}]}

Why is it splitting up the mail token? I thought it might be the max length, but I set it to 100 to be sure.

I am using ES 6.3.

The tokenizer uses a list of TLDs and not just anything like a@b.WHATEVER, which sometimes needs some time to update. I dont know on top of my head if .mail is a valid TLD, if it is, this may require an update in lucene.

A quick google search shows that .mail is not approved by ICANN. I do not know the input source of the data in lucene though.

hope this helps!

Does ICANN disallow local domains in email addresses?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.