Uax_url_email tokenizer unexpected result

doaks · March 17, 2019, 11:20pm

I define a sub-field like so:

        "tokenizer": {
            "my_email_tokenizer": {
                "type": "uax_url_email",
                "max_token_length": 100,
            }
        },
        ....
            "my_email_analyzer": {
                "type": "custom",
                "tokenizer": "my_email_tokenizer",
                "filter": ["lowercase", "stop","length_filter"]
            },
            ...
            "fields": {
                "emails":{
                   "type":"text",
                   "analyzer":"my_email_analyzer", 
                },

However when I try and analyze the email "foobar@baz.mail" against this field, the result is:

{'tokens': [{'end_offset': 13,
'position': 0,
'start_offset': 0,
'token': 'foobar@baz.ma',
'type': ''},
{'end_offset': 15,
'position': 1,
'start_offset': 13,
'token': 'il',
'type': ''}]}

Why is it splitting up the mail token? I thought it might be the max length, but I set it to 100 to be sure.

I am using ES 6.3.

spinscale · March 20, 2019, 8:46am

The tokenizer uses a list of TLDs and not just anything like a@b.WHATEVER, which sometimes needs some time to update. I dont know on top of my head if .mail is a valid TLD, if it is, this may require an update in lucene.

A quick google search shows that .mail is not approved by ICANN. I do not know the input source of the data in lucene though.

hope this helps!

doaks · March 20, 2019, 9:15pm

Does ICANN disallow local domains in email addresses?

system · April 17, 2019, 9:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
UAX URL Email Tokenizer not working Elasticsearch	3	509	April 30, 2020
Uax_url_email tokenizer not recognising valid emails with no dots on the email domain Elasticsearch	2	21	August 5, 2024
Indexing emails that come in uppercase, won't match lowercase searches Elasticsearch	10	2489	July 6, 2017
ElasticSearch standard Analyzer - exceptional case Elasticsearch	10	1026	January 10, 2018
Email Analyzer failing in 0.16.0 Elasticsearch	2	288	July 6, 2017

Uax_url_email tokenizer unexpected result

Related topics