Hello!
I'm new to ES and I'm trying to figure out how to make a query which asserts the following:
- All given tokens are present in the field - regardless of the order
- No other token are present.
Ideally, I'd also be able to confirm matching frequency of token in the field.
e.g. my index contains the following documents
{"message": "hello world"}
{"message": "hello hello world"}
{"message": "hello hello world byebye"}
And I'd need to be able to be able to generate a query from "hello world hello" which would match the second document only. This solution needs to be scalable.
Solution 1 :
Use regex queries. This is rather slow so I'd avoid if possible.
Solution 2:
Pass the field through the analyzer to get the tokenized/clean version, produce a count of each unique term, sort the count by alphanumerical order and add a field which contains this sorted count. Finally perform a query which asserts :
- That every token are present.
- The sorted-unique-token-count matches as a keyword.
This new field asserts that there are the same amount of unique token and that their count is the same.
e.g.
{"message": "hello world", "ucount":"1/1"}
{"message": "hello hello world, "ucount":"2/1""}
{"message": "hello hello world byebye", "ucount":"1/2/1"}
The problem with this solution is that I can't bulk call the analyzer to perform the tokenization/cleaning of the fields - calling it for every entry is not realistic in my case. I could workaround by reimplementing the analyzer in my code but then it becomes a potential cause for bug if there is ever a change of that analyzer in a subsequent version of ES.
I'm hopeful that there is a simpler way - or at least quicker - to do this - some kind of "match only" - or maybe a way to get a similar behaviour by using a combination of "must" and "must_not".
I'm using ES 7 and the python client interface.
Many tanks