Using version 7.9. And having the following settings:
"email_filter": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"([^@]+)",
"(\\p{L}+)",
"(\\d+)",
"@(.+)"
]
}
},
"analyzer": {
"email_analyzer": {
"filter": [
"email_filter",
"lowercase"
],
"tokenizer": "uax_url_email"
}
}
When doing an _analyze query:
{
"analyzer": "email_analyzer",
"text" : "myemail@gmail"
}
I get the following tokens:
{
"tokens": [
{
"token": "myemail",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "gmail",
"start_offset": 8,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Observe that it is not being recognised as EMAIL, it is showing ALPHANUM and I don't get the original token "myemail@gmail"
But when doing an _analyze query on an email with a dot on the domain:
{
"analyzer": "email_analyzer",
"text" : "myemail@gmail.com"
}
I get the expected response, being recognised as an EMAIL and "myemail@gmail.com" as the original token as well.
But why an email without a dot on the domain, which is valid, is not being recognised as EMAIL by the uax_url_email tokenizer?