Like query *string*


(Ashish Tiwari) #1

Hi guys,

Specification:
Elasticsearch version : 6.2.2

I have data which having "emailaddress" field. Suppose my one of the value is "user@example.com". I might have search input like "@exam", "user@" , "user", ".com" etc.

I have tried to achieve this using query_string with ** pattern match. I got appropriate result but It seems, it is slow. I also cannot use uax_url_email tokenizer because user is free to give any input. I have used default standard tokenizer.

With this, suppose i have two email address "user@example.com" and "example@user.com". If i am searching with "user@", It is showing both result. Which is not my output.

Is there any specific tokenizer or method by which i can achieve my output. It should be same as like mysql's like query with %@user%.

After this i used edge_ngram tokenizer which gives me partially better output. Below is my analyzer:

"analysis":{
  "filter":{
	"emailtoken":{
	  "type":"edge_ngram",
	  "min_gram":1,
	  "max_gram":255,
	  "token_chars":[
	    "letter",
	    "digit",
	    "symbol",
	    "punctuation"
	  ]
	}
  },
  "analyzer":{
	"email":{
	  "type":"custom",
	  "tokenizer":"standard",
	  "filter":[
	    "lowercase",
	    "emailtoken"
	  ]
	 }
  }
 }
}

With this, Suppose i have email address like "name.sirname@example.com", So above anlyzer also got failed with the search of ".sirname" or "@example".

Thanks


(Florian Kelbert) #2

Hi @ashishtiwari1993,

As you described, ** does work, but gives poor performance. This is due to how Elasticsearch indexes its documents internally.

Another approach is to index your strings multiple times (using fields) using different custom analyzers. In particular, you might want to look into the NGram tokenizer.

You might then also want to consider combining NGrams with a prefix query.


(Ashish Tiwari) #3

@fkelbert Thanks for quick prompt. I have edit my question forgot to mention about edge_ngram tokenizer. In edge_ngram i am facing issue with special chars like ".", "@".


(Florian Kelbert) #4

Hi @ashishtiwari1993, Consider using NGrams instead of Edge-NGrams .


(Florian Kelbert) #5

Further to that, you probably want to use the keyword tokenizer, which does not actually tokenize the string (i.e., it is a noop and keeps the string as-is).


(Ashish Tiwari) #6

@fkelbert Thats cool it wokrs :slight_smile: Thanks for quick help. Just one doubt, I am having lots of records which contains email address. Now ngram tokenizer is going to apply on each email address. Would it be cause and delay in write performance ? Because i have multiple fields, On which want to implement the same method .


(Florian Kelbert) #7

Great to hear :slight_smile:

I'm afraid you'll need to experiment with performance. You can increase write throughput by configuring your index to use more primary shards.

Out of curiosity: how much faster are the ngrams compared to the ** method for you?


(Florian Kelbert) #8

Another note: Think twice whether you actually need this kind of method for many of your fields. For search, the "more standard" analyzers do a pretty good job in 95% of the use cases.


(Ashish Tiwari) #9

Right now i am using same **. Need to reindex the data to measure performance. But it will definitely faster than to regex, wildcard & **. I will share my output here :slight_smile: . Thanks for the suggestion @fkelbert. Definitely i ll take care of use case.


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.