I have data which having "emailaddress" field. Suppose my one of the value is "user@example.com". I might have search input like "@exam", "user@" , "user", ".com" etc.
I have tried to achieve this using query_string with ** pattern match. I got appropriate result but It seems, it is slow. I also cannot use uax_url_email tokenizer because user is free to give any input. I have used default standard tokenizer.
With this, suppose i have two email address "user@example.com" and "example@user.com". If i am searching with "user@", It is showing both result. Which is not my output.
Is there any specific tokenizer or method by which i can achieve my output. It should be same as like mysql's like query with %@user%.
After this i used edge_ngram tokenizer which gives me partially better output. Below is my analyzer:
As you described, ** does work, but gives poor performance. This is due to how Elasticsearch indexes its documents internally.
Another approach is to index your strings multiple times (using fields) using different custom analyzers. In particular, you might want to look into the NGram tokenizer.
You might then also want to consider combining NGrams with a prefix query.
@fkelbert Thanks for quick prompt. I have edit my question forgot to mention about edge_ngram tokenizer. In edge_ngram i am facing issue with special chars like ".", "@".
Further to that, you probably want to use the keyword tokenizer, which does not actually tokenize the string (i.e., it is a noop and keeps the string as-is).
@fkelbert Thats cool it wokrs Thanks for quick help. Just one doubt, I am having lots of records which contains email address. Now ngram tokenizer is going to apply on each email address. Would it be cause and delay in write performance ? Because i have multiple fields, On which want to implement the same method .
Another note: Think twice whether you actually need this kind of method for many of your fields. For search, the "more standard" analyzers do a pretty good job in 95% of the use cases.
Right now i am using same **. Need to reindex the data to measure performance. But it will definitely faster than to regex, wildcard & **. I will share my output here . Thanks for the suggestion @fkelbert. Definitely i ll take care of use case.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.