Like query *string*

Hi guys,

Specification:
Elasticsearch version : 6.2.2

I have data which having "emailaddress" field. Suppose my one of the value is "user@example.com". I might have search input like "@exam", "user@" , "user", ".com" etc.

I have tried to achieve this using query_string with ** pattern match. I got appropriate result but It seems, it is slow. I also cannot use uax_url_email tokenizer because user is free to give any input. I have used default standard tokenizer.

With this, suppose i have two email address "user@example.com" and "example@user.com". If i am searching with "user@", It is showing both result. Which is not my output.

Is there any specific tokenizer or method by which i can achieve my output. It should be same as like mysql's like query with %@user%.

After this i used edge_ngram tokenizer which gives me partially better output. Below is my analyzer:

"analysis":{
  "filter":{
	"emailtoken":{
	  "type":"edge_ngram",
	  "min_gram":1,
	  "max_gram":255,
	  "token_chars":[
	    "letter",
	    "digit",
	    "symbol",
	    "punctuation"
	  ]
	}
  },
  "analyzer":{
	"email":{
	  "type":"custom",
	  "tokenizer":"standard",
	  "filter":[
	    "lowercase",
	    "emailtoken"
	  ]
	 }
  }
 }
}

With this, Suppose i have email address like "name.sirname@example.com", So above anlyzer also got failed with the search of ".sirname" or "@example".

Thanks

Hi @ashishtiwari1993,

As you described, ** does work, but gives poor performance. This is due to how Elasticsearch indexes its documents internally.

Another approach is to index your strings multiple times (using fields) using different custom analyzers. In particular, you might want to look into the NGram tokenizer.

You might then also want to consider combining NGrams with a prefix query.

@fkelbert Thanks for quick prompt. I have edit my question forgot to mention about edge_ngram tokenizer. In edge_ngram i am facing issue with special chars like ".", "@".

Hi @ashishtiwari1993, Consider using NGrams instead of Edge-NGrams .

Further to that, you probably want to use the keyword tokenizer, which does not actually tokenize the string (i.e., it is a noop and keeps the string as-is).

@fkelbert Thats cool it wokrs :slight_smile: Thanks for quick help. Just one doubt, I am having lots of records which contains email address. Now ngram tokenizer is going to apply on each email address. Would it be cause and delay in write performance ? Because i have multiple fields, On which want to implement the same method .

Great to hear :slight_smile:

I'm afraid you'll need to experiment with performance. You can increase write throughput by configuring your index to use more primary shards.

Out of curiosity: how much faster are the ngrams compared to the ** method for you?

Another note: Think twice whether you actually need this kind of method for many of your fields. For search, the "more standard" analyzers do a pretty good job in 95% of the use cases.

Right now i am using same **. Need to reindex the data to measure performance. But it will definitely faster than to regex, wildcard & **. I will share my output here :slight_smile: . Thanks for the suggestion @fkelbert. Definitely i ll take care of use case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.