Multi-value multi-field tag search

Hi,

I've just started with ES and studied master and reference guides but still can't figure out how to solve my particular use case.

I have a set of FAQ type of documents with a number of fields that boil down to:

{
    "question": {
        "raw_text": "question text",
        "tags": ["tag1", "tag2", "multi word tag1"]
    },
    "answer": {
        "raw_text": "answer text",
        "tags": [""tag2", "multi word tag2"]
    }
}

I have set up a mapping for these documents as "raw_text" field being of type "text" and "tags" being of type "keyword".

During search time the query will have the same structure as question/answer, i.e. the user inputs a raw query text but an extra pre-processing step attaches tags to it. I would like the results to be fetched based on a combination of these fields, that is:

  • perform a multi-match of query text on question.raw_text and answer.raw_text --> the more (analyzed) words match, the higher the score
  • perform some type of search of query tags on question.tags and answer.tags, where tags can be single-word or multi-word just like in the documents --> the more (unanalyzed) tags match (for now, independently whether they match on answer or question), the higher the score
  • combine the scores from raw_text and tag matches in some way (probably the raw text should have a higher weight but I'd like to experiment with that)

The raw_text query part is clear to me but I'm struggling with the part relative to the tags.
I have so far tried the following options:

{'query': {
    'query_string': {
        'query': 'tag1 OR multi word tag2',
        'fields': ['answer.tags', 'question.tags'],
        'type': 'cross_fields'}}}

{'query': {
    'bool': {
        'should': [
            {'match': {'answer.tags': 'tag1'}},
            {'match': {'answer.tags': 'multi word tag2'}},
            {'match': {'question.tags': 'tag1'}},
            {'match': {'question.tags': 'multi word tag2'}}
        ]
    }}}

{'query': {
    'bool': {
        'should': [
            {'terms': {'answer.tags': ['tag1', 'multi word tag2']}},
            {'terms': {'question.tags': ['tag1', 'multi word tag2']}}
        ]
    }}}

{'query': {
    'bool': {
        'should': [
            {'multi_match': {
                'query': 'tag1',
                'fields': ['question.tags', 'answer.tags']}},
            {'multi_match': {
                'query': 'multi word tag2',
                'fields': ['question.tags', 'answer.tags']}}
        ]
    }}}

Running these on actual data (ignoring the raw text for now), the most reasonable results seem to come from a chain for 4 match queries, the next best is multi_match. Terms-based query is not able to deliver the right response on the top position and the query string query is awful (I realize it's probably the harder one to learn so clearly I don't understand it).

My questions are:

  • is it reasonable to index fields like tags as keywords if I don't intend to perform filtering on them, only search?
  • how can I improve the terms-based query to the level of match-based queries? (I have many fields with a number of tags so it seems ugly to me to separate them manually instead of passing them as lists)
  • am I on the wrong track with all this? if so, can somebody point me to the relevant chapter/post/whatever that I should read?

Apologies for the long post, I'm new.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.