I've figured out what is happening here:
The tokenizer uax_url_email treats the email as a single token, so
what is implied in this forum post
http://elasticsearch-users.115913.n3.nabble.com/Search-Email-Part-td2835831.html,
i.e. that uax_url_email tokenizes the email is not correct. The
explanation here is more accurate:
http://elasticsearch-users.115913.n3.nabble.com/Problem-with-searching-Emails-td2902363.html#a2902602.
If you want to tokenize the email on "@" and "." then a pattern
tokenizer can be used.
This is an example mapping that has both:
index:
mapper:
dynamic: false
analysis:
analyzer:
default:
type: standard
stopwords: none
sortable is used to enable case-independent sorting
sortable:
tokenizer: keyword
filter: [lowercase]
uax_url_email is used to analyse emails as a single token.
uax_url_email:
tokenizer: uax_url_email
filter: [standard, lowercase, stop]
email_tokenizer is designed to tokenize emails on "." and "@"
email_tokenizer:
type: pattern
pattern: "[\\.@]"
Interestingly enough, its possible to get some quite nice partial
matches using the uax_url_email tokenizer by creating a String query
that allows leading wildcards, so for an email
"test.usergmail.com", the following would all match using and
leading and following wildcards (NB. use of leading wildcards is
resource intensive and can affect performance, if used across more
than one or two fields):
"usergmail", "test." "@gmail" "er@gm" etc.
-David.
On Jan 10, 9:46 am, davrob2 davirobe...@gmail.com wrote:
Bump on this, I'm pretty sure that email tokenizing has been sorted
since 16.0 (http://elasticsearch-users.115913.n3.nabble.com/Search-Email-Part-tp2...
) but for some reason I can't get it working.
On Jan 9, 4:33 pm, davrob2 davirobe...@gmail.com wrote:
Hi,
I've defined an analyser and used it in a Mapping as defined below:
Email Tokenizer Problem · GitHub
But when I enter something like @gmail.com, it is not matching the
wildcard search, it only matches when I enter "test.u...@gmail.com" or
"test.user" or "test." etc. i.e. there appears to be no tokenization
on the "@" or the ".".
This is the query I'm using:
Email Tokenizer Problem · GitHub
Regards,
David.