Tokenize email address


(Ask Bjørn Hansen) #1

Hi everyone,

We're indexing a user database for an admin interface and would like to
search on email addresses. It works fine except for email addresses with
'.'s in them where our users expect to be able to search on either name
("some.user@domain" should match "some" or "user" or "domain"). Anyway, we
can't get any of the built-in tokenizers to split on the "." or on "_" for
that matter. The standard tokenizer works as expected on "-", but most
email addresses have dots. :slight_smile:

Any tips? What am I doing wrong?

Ask

$ curl
'http://indexdev1.la.sol:9200/us-devel-rms-v1/_analyze?pretty=1&tokenizer=uax_url_email'
-d 'some.user@domain'
{
"tokens" : [ {
"token" : "some.user",
"start_offset" : 0,
"end_offset" : 9,
"type" : "",
"position" : 1
}, {
"token" : "domain",
"start_offset" : 10,
"end_offset" : 16,
"type" : "",
"position" : 2
} ]
}


(Clinton Gormley) #2

Hi Ask

We're indexing a user database for an admin interface and would like
to search on email addresses. It works fine except for email
addresses with '.'s in them where our users expect to be able to
search on either name ("some.user@domain" should match "some" or
"user" or "domain"). Anyway, we can't get any of the built-in
tokenizers to split on the "." or on "_" for that matter. The
standard tokenizer works as expected on "-", but most email addresses
have dots. :slight_smile:

If your field contains just email addresses, then use the 'simple'
analyzer.

If your email addresses are embedded in other text, then you may want to
do something smarter with the (undocumented) pattern replace filter:

clint


(Ask Bjørn Hansen) #3

On Feb 10, 3:59 am, Clinton Gormley cl...@traveljury.com wrote:

If your field contains justemailaddresses, then use the 'simple'
analyzer.

Brilliant, thank you! This works perfectly.

I have a variation of this now - we have another query where we need
to make sure that the email address matches exactly, except for case
variations. I'm indexing it with the email mapping below now, but
that requires me to lower case the input (and thus it gets messed up
for display in _source).

I have another field too (an ID type field) where I need to be able to
filter for it case-insensitively, but keep it in the _source as the
original. There I have two fields in my mapping, like:

                provider_uid => {type => "string", index =>

"not_analyzed"},
provider_uid_lc => {type => "string", index =>
"not_analyzed"},

The _lc version is excluded from _source and I lowercase it when
indexing, and then use provider_uid for display etc in my
application. Is there a better way? It seems like the multi_field is
what I want, but I couldn't figure out how to lowercase, but not
tokenize.

Here's the email mapping:

                email => {
                    type   => "multi_field",
                    fields => {
                        "email" => {
                            type     => "string",
                            index    => "analyzed",
                            analyzer => "simple",
                        },
                        "raw" => {
                            type  => "string",
                            index => "not_analyzed",
                        },
                    }
                },

Thanks again! The excellent software wouldn't be as good as it is
without the great community helping the lost and clueless such as
myself. :slight_smile:

Ask


(Clinton Gormley) #4

I have a variation of this now - we have another query where we need
to make sure that the email address matches exactly, except for case
variations. I'm indexing it with the email mapping below now, but
that requires me to lower case the input (and thus it gets messed up
for display in _source).

The _lc version is excluded from _source and I lowercase it when
indexing, and then use provider_uid for display etc in my
application. Is there a better way? It seems like the multi_field is
what I want, but I couldn't figure out how to lowercase, but not
tokenize.

Yes, a multi-field is the way to go. You need a custom analyzer which
uses the 'keyword' tokenizer and the 'lowercase' filter:

clint


(Ask Bjørn Hansen) #5

On Feb 14, 1:40 am, Clinton Gormley cl...@traveljury.com wrote:

Yes, a multi-field is the way to go. You need a custom analyzer which
uses the 'keyword'tokenizerand the 'lowercase' filter:

https://gist.github.com/1825357

Brilliant, thank you! I completely missed that the 'keyword' analyzer
is the 'null tokenizer' I couldn't find.

Ask


(system) #6