Indexing emails that come in uppercase, won't match lowercase searches


(Ron) #1

So I created this analyzer:

"uax_url_email" : {
     "filters" : [
        "standard",
        "lowercase",
        "stop"
     ],
     "type": "custom",
     "tokenizer" : "uax_url_email"
  }

And my mapping for the given field uses it:

        "email": {
          "type": "string",
          "analyzer": "uax_url_email"
        }

When my document are indexed, most of them actually come in with the email value beign uppercased. I didn't think this should really be an issue because when I search I can lowercase the term(s).

The problem is that if I do a search, it will only match if the term is cased as it was in the document - i.e. if the document had an uppercased email, it will only match if I provide the uppercased email in the search:

post candidate/main/_search
{
  "query": {
    "term" : {
      "email" : "MATCHES@GMAIL.COM"
    }
  }
}

The above would return a value because it's uppercased, but if I issue the same search as 'matches@gmail.com', nothing is returned.

Am I doing something wrong? The lowercase filter from the analyzer means the indexed terms should all be lowercased but tokenized as emails (per the uax tokenizer), but the searches seem to require the same case as the original document.

I have some good experience with Lucene but this is my first foray into ElasticSearch. I'm using 1.6.0.


(David Pilato) #2

A first guess is that your mapping has not been applied.

May be you could reproduce it with a script so we can tell what's wrong...


(Ron) #3

This is a brand new index. I have deleted/recreated the index a few times, verifying the mappings & settings, and repopulating with my test records.

The sequence I'm running is right here:

DELETE my_index
PUT my_index
POST my_index/_close
PUT my_index/_settings
{
  "analysis" : {
    "analyzer" : {
      "email_analyzer" : {
         "filters" : [
            "lowercase"
         ],
         "type": "custom",
         "tokenizer" : "uax_url_email"
      }
    }
  }
}
POST my_index/_open
PUT my_index/main/_mapping
{
  "properties": {
	"email": {
	  "type": "string",
	  "analyzer": "email_analyzer"
	}
  }
}

It's pretty confusing :frowning:

edit: since my post here I did adjust to remove the standard and stop filters, which didn't do anything to help me, but that's why it's a bit different from my earlier post


(David Pilato) #4

Then? How do you index?

Note: don't close/open the index. The first PUT index can contain your settings.


(Ron) #5

I have a Java application as populating the index with some documents including email addresses. The emails that come in are uppercase.

The analyzer is supposed to standardize the indexed terms correct? My expectation was that with the lowercase filter in my email_analyzer that the searchable terms would be lowercase.

So no matter what I'd expect to search using lowercase values, possibly with an analyzer specified in the querybuilder object.

But when I do a search it is only returning if I search with uppercase. That is to say it seems like the search on the email address is case sensitive.


(David Pilato) #6

You could use _analyze API to check which tokens are produced by the analyzer.


(Ron) #7

it comes out like this:

{
  "tokens" : [ {
    "token" : "TEST@EMAIL.COM",
    "start_offset" : 0,
    "end_offset" : 20,
    "type" : "<EMAIL>",
    "position" : 1
  } ]
}

I'm not really sure what I'm looking at.

This is a full test case:

PUT my_index
{
  "analysis" : {
    "analyzer" : {
      "email_analyzer" : {
         "filters" : [
            "lowercase"
         ],
         "type": "custom",
         "tokenizer" : "uax_url_email"
      }
    }
  }
}

PUT my_index/main/_mapping
{
  "properties": {
  	"email": {
  	  "type": "string",
  	  "analyzer": "email_analyzer"
  	}
  }
}

PUT my_index/main/1
{
  "email" : "TEST@EMAIL.COM"
}

/* this returns the document */
GET my_index/main/_search 
{
  "query": {
    "match": {
      "email": "TEST@EMAIL.COM"
    }
  }
}

/* this does NOT return the document */
GET my_index/main/_search 
{
  "query": {
    "match": {
      "email": "test@email.com"
    }
  }
}

Just trying to figure out what, if anything, I'm doing wrong.


(David Pilato) #8

Got it!

Replace filters by filter :smile:

PUT my_index
{
"analysis" : {
"analyzer" : {
"email_analyzer" : {
"filter" : [
"lowercase"
],
"type": "custom",
"tokenizer" : "uax_url_email"
}
}
}
}


(Ron) #9

:facepalm:

thanks!


(Ron) #10

It did work the way I expected after that change. Elastic has been real good about letting me know about malformed requests I didn't think to look for a typo.


(system) #11