We need to implement a free text search, so that a user can search a string and we need to return the docs which have this string in one of multiple fields..
So I've written a multi match, cross fields query on the required fields
Now, name field is standard fully analyzed, company is not_analyzed (only exact matches should return) and email field is analyzed with tokenizer keyword and filter lowercase (I need exact non case sensitive match of email).
The weird behaviour is when I search for multiple terms which do not exist:
If I search for "Alex Facebook" - I expect to get all docs that the 3 fields above contain either "Alex" or "Facebook", and it there are no matches for "Facebook", I still expect to get the docs which match to "Alex".
But, if I search for a value which matches one of the not fully analyzed fields with another value which does not exists - I get no results.
Example:
Query: "Amazon James"
There is a doc which company = "Amazon", but there isn't any match for "James" - no result returns.
Can someone explain this behavior and how I can overcome it?
Given you have specified "not_analaysed" for "company", it will be using the full untokenized query and only match companies actually named "Amazon James".
Yes, in a similar situation, I've seen a bool/should query work OK. But a problem you need to address first is that you will have trouble getting a string like "Amazon James" to match "Amazon" on a "not_analysed" field. You might get away with specifying a query time analyser consisting of a standard tokenizer and no token filters, but then no queries would match multi-word company names like "Mercedes Benze". There might be some clever things you could do using shingle token filters in the query analyser to work around this, but it depends on what your requirements really are.
One last thing, I've did some testing and with seems like using query_string instead of multi_match works on the search ''Amazon James" on a non-analyzed field where only Amazon matches..
Why is the difference in behavior between query_string and multi_match?
From what I understood, query_string by default splits the entire query to multiple terms by spaces and applies OR operator between them. So basically "Amazon James" is analyzed separately as "Amazon" and "James", for companies as Mercedes Benz they should be passed to the query as "Mercedes Benz" and then this will be the whole term.
Are you specifying specific fields in the query, or are you leaving it as the default "_all" field? I ask this because I suspect that everything in the "_all" field is analysed using the "standard" analyser, and so "Benze Mercedes" would also match "Mercedes Benze".
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.