Email Tokenizing Not Working?


(davrob) #1

Hi,

I've defined an analyser and used it in a Mapping as defined below:

But when I enter something like @gmail.com, it is not matching the
wildcard search, it only matches when I enter "test.user@gmail.com" or
"test.user" or "test." etc. i.e. there appears to be no tokenization
on the "@" or the ".".

This is the query I'm using:

Regards,

David.


(davrob) #2

Bump on this, I'm pretty sure that email tokenizing has been sorted
since 16.0 ( http://elasticsearch-users.115913.n3.nabble.com/Search-Email-Part-tp2835831p2835885.html
) but for some reason I can't get it working.

On Jan 9, 4:33 pm, davrob2 davirobe...@gmail.com wrote:

Hi,

I've defined an analyser and used it in a Mapping as defined below:

https://gist.github.com/1583636

But when I enter something like @gmail.com, it is not matching the
wildcard search, it only matches when I enter "test.u...@gmail.com" or
"test.user" or "test." etc. i.e. there appears to be no tokenization
on the "@" or the ".".

This is the query I'm using:

https://gist.github.com/1583708

Regards,

David.


(davrob) #3

I've figured out what is happening here:

The tokenizer uax_url_email treats the email as a single token, so
what is implied in this forum post
http://elasticsearch-users.115913.n3.nabble.com/Search-Email-Part-td2835831.html,
i.e. that uax_url_email tokenizes the email is not correct. The
explanation here is more accurate:
http://elasticsearch-users.115913.n3.nabble.com/Problem-with-searching-Emails-td2902363.html#a2902602.

If you want to tokenize the email on "@" and "." then a pattern
tokenizer can be used.

This is an example mapping that has both:

index:
mapper:
dynamic: false
analysis:
analyzer:
default:
type: standard
stopwords: none

sortable is used to enable case-independent sorting

   sortable:
     tokenizer: keyword
     filter: [lowercase]

uax_url_email is used to analyse emails as a single token.

   uax_url_email:
     tokenizer: uax_url_email
     filter: [standard, lowercase, stop]

email_tokenizer is designed to tokenize emails on "." and "@"

   email_tokenizer:
     type: pattern
     pattern: "[\\.@]"

Interestingly enough, its possible to get some quite nice partial
matches using the uax_url_email tokenizer by creating a String query
that allows leading wildcards, so for an email
"test.usergmail.com", the following would all match using and
leading and following wildcards (NB. use of leading wildcards is
resource intensive and can affect performance, if used across more
than one or two fields):

"usergmail", "test." "@gmail" "er@gm" etc.

-David.

On Jan 10, 9:46 am, davrob2 davirobe...@gmail.com wrote:

Bump on this, I'm pretty sure that email tokenizing has been sorted
since 16.0 (http://elasticsearch-users.115913.n3.nabble.com/Search-Email-Part-tp2...
) but for some reason I can't get it working.

On Jan 9, 4:33 pm, davrob2 davirobe...@gmail.com wrote:

Hi,

I've defined an analyser and used it in a Mapping as defined below:

https://gist.github.com/1583636

But when I enter something like @gmail.com, it is not matching the
wildcard search, it only matches when I enter "test.u...@gmail.com" or
"test.user" or "test." etc. i.e. there appears to be no tokenization
on the "@" or the ".".

This is the query I'm using:

https://gist.github.com/1583708

Regards,

David.


(Clinton Gormley) #4

The tokenizer uax_url_email treats the email as a single token, so
what is implied in this forum post

Ooh - interesting. I assumed that this sentence:

    A tokenizer of type uax_url_email which works exactly like the
    standard tokenizer, but also handles emails and urls.

...meant that it tokenized emails and urls into individual parts, not as
single tokens.

thanks for the post. I've updated the docs to reflect that:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html

clint


(Ask Bjørn Hansen) #5

On Jan 11, 3:43 am, Clinton Gormley cl...@traveljury.com wrote:

Thetokenizeruax_url_email treats theemailas a single token, so
what is implied in this forum post

Ooh - interesting. I assumed that this sentence:

    Atokenizerof type uax_url_email which works exactly like the
    standardtokenizer, but also handles emails and urls.

...meant that it tokenized emails and urls into individual parts, not as
single tokens.

On Jan 11, 3:43 am, Clinton Gormley cl...@traveljury.com wrote:

Thetokenizeruax_url_email treats theemailas a single token, so
what is implied in this forum post

Ooh - interesting. I assumed that this sentence:

    Atokenizerof type uax_url_email which works exactly like the
    standardtokenizer, but also handles emails and urls.

...meant that it tokenized emails and urls into individual parts, not as
single tokens.

_analyze seems to show something else, or am I calling it wrong?

$ curl -XGET 'indexdev1.la.sol:9200/us-devel-rms-v1/_analyze?
tokenizer=uax_url_email&format=text' -d 'some.email@DOMAIN.COM'; echo
{"tokens":"[some.email:0->10:]\n\n2: \n[domain.com:11-

21:]\n"}

Ask

--
http://askask.com/


(Clinton Gormley) #6

_analyze seems to show something else, or am I calling it wrong?

$ curl -XGET 'indexdev1.la.sol:9200/us-devel-rms-v1/_analyze?
tokenizer=uax_url_email&format=text' -d 'some.email@DOMAIN.COM'; echo
{"tokens":"[some.email:0->10:]\n\n2: \n[domain.com:11-

21:]\n"}

What version of ES are you running? The form that allows you to pass
'tokenizer' params is only available as of 0.19.0RC1

Otherwise you're actually just using the 'standard' analyzer

c


(system) #7