Match only?

Hello!

I'm new to ES and I'm trying to figure out how to make a query which asserts the following:

  • All given tokens are present in the field - regardless of the order
  • No other token are present.

Ideally, I'd also be able to confirm matching frequency of token in the field.

e.g. my index contains the following documents

{"message": "hello world"}
{"message": "hello hello world"}
{"message": "hello hello world byebye"}

And I'd need to be able to be able to generate a query from "hello world hello" which would match the second document only. This solution needs to be scalable.

Solution 1 :
Use regex queries. This is rather slow so I'd avoid if possible.

Solution 2:
Pass the field through the analyzer to get the tokenized/clean version, produce a count of each unique term, sort the count by alphanumerical order and add a field which contains this sorted count. Finally perform a query which asserts :

  1. That every token are present.
  2. The sorted-unique-token-count matches as a keyword.

This new field asserts that there are the same amount of unique token and that their count is the same.

e.g.

{"message": "hello world", "ucount":"1/1"}
{"message": "hello hello world, "ucount":"2/1""}
{"message": "hello hello world byebye", "ucount":"1/2/1"}

The problem with this solution is that I can't bulk call the analyzer to perform the tokenization/cleaning of the fields - calling it for every entry is not realistic in my case. I could workaround by reimplementing the analyzer in my code but then it becomes a potential cause for bug if there is ever a change of that analyzer in a subsequent version of ES.

I'm hopeful that there is a simpler way - or at least quicker - to do this - some kind of "match only" - or maybe a way to get a similar behaviour by using a combination of "must" and "must_not".

I'm using ES 7 and the python client interface.

Many tanks

You may use match phrase query. To get token frequency, you may use Term vectors API.

Thanks for the quick answer. I've considered using match_phrase but it seems it is not agnostic to the term orders ("hello world hello" does not match with "hello hello world").

I've also looked at the term vectors but it seems to be only available at the document level and not at the field level.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.