The most_fields and phrase type of the multi match query might look promising on the first glance, but it is not what I'm looking for.
most_fields:
By combining scores from all three fields we can match as many documents as possible with the main field, but use the second and third fields to push the most similar results to the top of the list.
phrase:
The phrase and phrase_prefix types behave just like best_fields, but they use a match_phrase or match_phrase_prefix query instead of a match query.
and
The best_fields type generates a match query for each field and wraps them in a dis_max query, to find the single best matching field.
I need a precise filter, not the most relevant results followed by a long tail. Both multi match types would give me a lot of false positives.
Let's use the same example. The search query = "nová A karta" [a new A card]
The most_fields strategy will match (among others) "nová karta" [a new card], "karta je nová" [the card is new] and even "Karty jsou rozdány, nová hra začíná." [The cards have been delt, a new game begins.] or a document with a single character "A".
The phrase strategy is much better but still not acceptable. It will match (among others) any string, where there is another stopword instead of the "A", e.g. "nové B karty" [new B cards], "*novou pod kartou" [*(with) a new under card]
Asterisk (*) is used in linguistics to indicate an ungrammatical statement.
When I use the hypothetical phrase_on_steroids match query type with query "nová A[content.verbatim] karta", I need it to analyze the words "nová" and "karta" with the same analyzer as the content
field and matched against the content
field, so that it matches all inflected forms. The word "A" must by analyzed with a different analyzer (the one used for the content.verbatim
field) and matched against the content.verbatim
field.
When a document "Včera jsem požádal o novou A kartu." [I applied for a new A card yesterday.] is indexed, the output of the analyzers looks like this:
content: včera (POS = 1), být (POS = 2), žádat (POS = 3), nový (POS = 5), karta (POS = 7)
content.verbatim: Včera (POS = 1), jsem (POS = 2), požádal (POS = 3), o (POS = 4), novou (POS = 5), A (POS = 6), kartu (POS = 7)
Phrase matching with "nová A karta" will match against the first field thanks to inflection but will give me false positives as well because of the missing stop word. It won't match against the second field because the inflection is not allowed here. It needs to be combined properly. In theory it is possible, because positions of the tokens are the same. The query "nová A[content.verbatim] karta" should match this token sequence:
nový (POS = 5@content), A (POS = 6@content.verbatim), karta (POS = 7@content)
Can I achieve this sort of behaviour? Or maybe should I post this somewhere as a feature request?