Tokenize email address

Ask_Bjorn_Hansen · February 10, 2012, 1:06am

Hi everyone,

We're indexing a user database for an admin interface and would like to
search on email addresses. It works fine except for email addresses with
'.'s in them where our users expect to be able to search on either name
("some.user@domain" should match "some" or "user" or "domain"). Anyway, we
can't get any of the built-in tokenizers to split on the "." or on "_" for
that matter. The standard tokenizer works as expected on "-", but most
email addresses have dots.

Any tips? What am I doing wrong?

Ask

$ curl
'http://indexdev1.la.sol:9200/us-devel-rms-v1/_analyze?pretty=1&tokenizer=uax_url_email'
-d 'some.user@domain'
{
"tokens" : [ {
"token" : "some.user",
"start_offset" : 0,
"end_offset" : 9,
"type" : "",
"position" : 1
}, {
"token" : "domain",
"start_offset" : 10,
"end_offset" : 16,
"type" : "",
"position" : 2
} ]
}

Clinton_Gormley · February 10, 2012, 11:59am

Hi Ask

We're indexing a user database for an admin interface and would like
to search on email addresses. It works fine except for email
addresses with '.'s in them where our users expect to be able to
search on either name ("some.user@domain" should match "some" or
"user" or "domain"). Anyway, we can't get any of the built-in
tokenizers to split on the "." or on "_" for that matter. The
standard tokenizer works as expected on "-", but most email addresses
have dots.

If your field contains just email addresses, then use the 'simple'
analyzer.

If your email addresses are embedded in other text, then you may want to
do something smarter with the (undocumented) pattern replace filter:

github.com/elastic/elasticsearch

PatternReplaceFilter

elastic:master ← belevian:pattern

opened 06:26PM - 08 Jul 11 UTC

belevian

+151 -0

This adds the Pattern Replace Filter to Elasticsearch. It allows to easily hand…le string replacement during the analysis process using the power of regular expression. The regular expression pattern has to be specified with the setting "pattern". The replacement expression has to be specified with the setting "replacement". <pre> index: analysis: analyzer: default: tokenizer: standard filter: [pattern_replace] filter: pattern_replace: type: pattern_replace pattern: (?<=[\d])(,)(?=[\d]) </pre> By default the replacement expression is an empty string.

clint

Ask_Bjorn_Hansen · February 14, 2012, 5:57am

On Feb 10, 3:59 am, Clinton Gormley cl...@traveljury.com wrote:

If your field contains justemailaddresses, then use the 'simple'
analyzer.

Brilliant, thank you! This works perfectly.

I have a variation of this now - we have another query where we need
to make sure that the email address matches exactly, except for case
variations. I'm indexing it with the email mapping below now, but
that requires me to lower case the input (and thus it gets messed up
for display in _source).

I have another field too (an ID type field) where I need to be able to
filter for it case-insensitively, but keep it in the _source as the
original. There I have two fields in my mapping, like:

                provider_uid => {type => "string", index =>

"not_analyzed"},
provider_uid_lc => {type => "string", index =>
"not_analyzed"},

The _lc version is excluded from _source and I lowercase it when
indexing, and then use provider_uid for display etc in my
application. Is there a better way? It seems like the multi_field is
what I want, but I couldn't figure out how to lowercase, but not
tokenize.

Here's the email mapping:

                email => {
                    type   => "multi_field",
                    fields => {
                        "email" => {
                            type     => "string",
                            index    => "analyzed",
                            analyzer => "simple",
                        },
                        "raw" => {
                            type  => "string",
                            index => "not_analyzed",
                        },
                    }
                },

Thanks again! The excellent software wouldn't be as good as it is
without the great community helping the lost and clueless such as
myself.

Ask

Clinton_Gormley · February 14, 2012, 9:40am

I have a variation of this now - we have another query where we need
to make sure that the email address matches exactly, except for case
variations. I'm indexing it with the email mapping below now, but
that requires me to lower case the input (and thus it gets messed up
for display in _source).

The _lc version is excluded from _source and I lowercase it when
indexing, and then use provider_uid for display etc in my
application. Is there a better way? It seems like the multi_field is
what I want, but I couldn't figure out how to lowercase, but not
tokenize.

Yes, a multi-field is the way to go. You need a custom analyzer which
uses the 'keyword' tokenizer and the 'lowercase' filter:

gist.github.com

https://gist.github.com/clintongormley/1825357

gistfile1.txt

# [Tue Feb 14 10:37:36 2012] Protocol: http, Server: 192.168.5.10:9200
curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "lc" : {
               "filter" : [
                  "lowercase"
               ],

This file has been truncated. show original

clint

Ask_Bjorn_Hansen · February 15, 2012, 12:10am

On Feb 14, 1:40 am, Clinton Gormley cl...@traveljury.com wrote:

Yes, a multi-field is the way to go. You need a custom analyzer which
uses the 'keyword'tokenizerand the 'lowercase' filter:

gist:1825357 · GitHub

Brilliant, thank you! I completely missed that the 'keyword' analyzer
is the 'null tokenizer' I couldn't find.

Ask

Topic		Replies	Views
Like query string Elasticsearch	9	1086	October 24, 2018
Uax_url_email tokenizer not recognising valid emails with no dots on the email domain Elasticsearch	2	22	August 5, 2024
Need to analyse email addresses Elasticsearch	5	984	July 5, 2017
Email Tokenizing Not Working? Elasticsearch	6	1216	July 6, 2017
Can we tokenise emailid text along with split characters in elasticsearch and prefixes tokens? Elasticsearch	1	192	December 5, 2022

Tokenize email address

Related topics