Phrase matching over multiple fields (not a multi_match case)


(Josef Toman) #1

I need a field to by analyzed in two ways:

  1. A normal analyzer (stop words, stemming) - the usual.
  2. Just the raw tokens (no stop words, no stemming).

I need the raw tokens to search for and distinguish some special edge cases. Examples:

  • C&A - A name of a clothing brand that would not survive a normal analyzer.

  • KB - Everyone knows that "kb" is a shorthand for kilobyte. But KB is also an acronym for a major Czech bank. I need to be able to tell these two apart.

So I have a text indexed in two fields - let's say content and content.verbatim. I'd like to know if it's possible to do a phrase matching over these fields in such a way, when one part of a query is evaluated against the first field and another part against the second. I don't care for scores and relevance, I need the phrase matching to act as a precise filter. Again an example:

  • A karta
    There is a credit card called "A karta". I need the content.verbatim field to catch the "A" and the content field to match the inflected forms of "karta". As a result I should be able to find documents containing "A karta", "A kartou", "A Karta", but not "a karta" or "A moje karta".

I was not able to find out myself how to do it. A well reasoned explanation why it is not possible will be also much appreciated.


(Alexander Reelsen) #2

Hey,

there are tons of different requirements in these few lines of explanations with wildly varying complexity, starting from not splitting terms up until context sensitive entity extraction (a pre indexing topic).

The Dealing with Human Language chapter in the definitive guide might be a good start.

--Alex


(Josef Toman) #3

Have been through it all at least twice already :slight_smile:

The examples ought to be just an illustration - to show, that I'm not looking for a multi_match query or some other trivial solution.

I'm about to build a next generation of a system, where users input a search expressions with standard logical operators and some more of our own design. The expressions go through a parser and in the end I get a syntactic tree, which I'd like to translate to ES (instead of a current in-house solution). I control neither the data nor the search queries, so I need a general solution.

Some sort of a phrase matching that would allow me to change contexts (different analyses of the same input text/field) throughout a single query would do the job. If I understand the output of the analysis correctly, the data (token positions) should allow it. But I didn't find a way to use it.

Here's what I'd like to do:

{
    "match" : {
        "content" : {
            "type" : "phrase_on_steroids",
            "query" : "nová A[content.verbatim] karta"
        }
    }
}

If it's not possible I will find another way. But this sort of a phrase matching would be the best.


(Alexander Reelsen) #4

different analyses of the same input text/field would be a classic multi field I suppose, where you query more than one of such fields at query time? Am I missing something?


(Josef Toman) #5

The most_fields and phrase type of the multi match query might look promising on the first glance, but it is not what I'm looking for.

most_fields:

By combining scores from all three fields we can match as many documents as possible with the main field, but use the second and third fields to push the most similar results to the top of the list.

phrase:

The phrase and phrase_prefix types behave just like best_fields, but they use a match_phrase or match_phrase_prefix query instead of a match query.

and

The best_fields type generates a match query for each field and wraps them in a dis_max query, to find the single best matching field.


I need a precise filter, not the most relevant results followed by a long tail. Both multi match types would give me a lot of false positives.

Let's use the same example. The search query = "nová A karta" [a new A card]

The most_fields strategy will match (among others) "nová karta" [a new card], "karta je nová" [the card is new] and even "Karty jsou rozdány, nová hra začíná." [The cards have been delt, a new game begins.] or a document with a single character "A".

The phrase strategy is much better but still not acceptable. It will match (among others) any string, where there is another stopword instead of the "A", e.g. "nové B karty" [new B cards], "*novou pod kartou" [*(with) a new under card]

Asterisk (*) is used in linguistics to indicate an ungrammatical statement.

When I use the hypothetical phrase_on_steroids match query type with query "nová A[content.verbatim] karta", I need it to analyze the words "nová" and "karta" with the same analyzer as the content field and matched against the content field, so that it matches all inflected forms. The word "A" must by analyzed with a different analyzer (the one used for the content.verbatim field) and matched against the content.verbatim field.

When a document "Včera jsem požádal o novou A kartu." [I applied for a new A card yesterday.] is indexed, the output of the analyzers looks like this:

content: včera (POS = 1), být (POS = 2), žádat (POS = 3), nový (POS = 5), karta (POS = 7)
content.verbatim: Včera (POS = 1), jsem (POS = 2), požádal (POS = 3), o (POS = 4), novou (POS = 5), A (POS = 6), kartu (POS = 7)

Phrase matching with "nová A karta" will match against the first field thanks to inflection but will give me false positives as well because of the missing stop word. It won't match against the second field because the inflection is not allowed here. It needs to be combined properly. In theory it is possible, because positions of the tokens are the same. The query "nová A[content.verbatim] karta" should match this token sequence:

nový (POS = 5@content), A (POS = 6@content.verbatim), karta (POS = 7@content)

Can I achieve this sort of behaviour? Or maybe should I post this somewhere as a feature request? :wink:


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.