I have to write a query that has to find documents where one of the fields (type: text, also available: keyword) matches a list of wildcard domains. I know that it's sub-optimal but that's what I have to work with. Since I can't filter for exact terms, I don't know if I have any other option than writing a long query string with wildcards. Something like this:
Searching for ends of strings is expensive because it means we can't efficiently accelerate lookups using the index (which stores terms alphabetically based on the start of the string). We end up scanning all index entries.
A way to counter that is to store a version of the strings which is reversed. It's a little clunky but here's an example mapping/doc/query that should be more efficient:
For multiple clauses you would use a bool query with multiple of the match_phrase_prefix clauses inside the should clause.
Another approach is to make use of the new wildcard field but, like most things this has trade-offs. These were discussed here where the original question was related to your exact same problem (searching for ends of domain names).
Thank you Mark for this!
That's easily implementable. Is there any limit for clauses? I have a handful of edge cases where I need to match hundreds of domains in the target index.
Edit: Could you please tell me why did you use "type": "binary", in the mapping?
Edit: Could you please tell me why did you use "type": "binary", in the mapping?
Oops. Not sure how that happened. That was supposed to be keyword but is irrelevant for the purposes of the example as we only use the "reversed" subfield. You'd probably want to do any aggregation on the containing keyword field though.
That's easily implementable. Is there any limit for clauses?
Yep, 1024 on clauses. You'd need to break your search into multiple requests.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.