Elasticsearch- highlighting on both “.keyword” and text fields


(Nikesh) #1

I am finding issues with highlighting when searching on fields using its complete data.I have used custom analyzers and each field is stored as text and keyword.
I am using whitespace as search analyzer.
My custom analyzer is:

 "analysis": {
                    "filter": {
                        "indexFilter": {
                            "type": "pattern_capture",
                            "preserve_original": "true",
                            "patterns": [
                                "([@,$,%,&,!,.,#,^,*]+)",
                                "([\\w,.]+)",
                                "([\\w,@]+)",
                                "([-]+)",
                                "(\\w+)"
                            ]
                        }
                    },
                    "analyzer": {
                        "indexAnalyzer": {
                            "filter": [
                                "indexFilter",
                                "lowercase"
                            ],
                            "tokenizer": "whitespace"
                        },
                        "searchAnalyzer": {
                            "filter": [
                                "lowercase"
                            ],
                            "tokenizer": "whitespace"
                        }
                    }

My mapping file is :

          "field": {
                "type": "text",
                "term_vector": "with_positions_offsets",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                },
                "analyzer": "indexAnalyzer",
                "search_analyzer": "searchAnalyzer"
            }

My query is :

{"from" : 0, "size" : 24, "query": {
    "bool": {"should": [{
          "multi_match": {
            "query":"monkey business",
            "type":"phrase",
            "slop":"2",
            "fields":[]
}}],"minimum_should_match":1}},  "highlight": {
        "type" : "unified",
        "fields": {
            "*": {}
        }
    }}

my highlight results are :

"highlight": {
                    "field.keyword": [
                        "<em>monkey business</em>"
                    ],
                    "field": [
                        "<em>monkey</em> <em>business</em>"
                    ]
                }

(Nikesh) #2

@elastic


(Edoardo) #3

Yes, the problem is the slop param in phrase match. I've notice the same result. I think that the analyzer in this case work ad an phrase + an match query with some param.


(Nikesh) #4

Could you please elaborate? How do i over come this issue?


(Jaspreet Singh) #5

What you are seeing in highlight.field is exactly the same outcome as a "match_phrase" query on "field" field, type text.
And what you are seeing in highlight.field.keyword is exactly the same as "match_phrase" on "field.keyword", of type keyword. (for keyword fields, match and match_phrase behave identically, as these aren't analyzed).

What really happens in match_phrase or a multi_match with type "phrase" is, query will find (and highlight) all search tokens that occur together. (for slop it is a touch different but concepts apply similarly and lets keep this simple).
So for text type, your search string "monkey business" translates into 2 separate tokens, "monkey" and "business".
Now your query finds all documents that has both of these together. That is exactly what you see in highlights - tokens "monkey" and "business" highlighted together, but in same "em" tag since they are separate tokens.
For keyword field, there is no analysis phase. So your search string is the exact same as search tokens, "monkey business". Any query (not just phrase) on keyword will try and find exact same search strings (include same case) and highlight them under same "em" tags.
Hope this helps. This may not be technically most accurate but explains what may be happening.


(Nikesh) #6

Thanks for your response.
The same issue persists even when "type": "phrase" is removed as well.

 {"from" : 0, "size" : 24, "query": {
        "bool": {"should": [{
              "multi_match": {
                "query":"monkey business",
                        "fields":[]
    }}],"minimum_should_match":1}},  "highlight": {
            "type" : "unified",
            "fields": {
                "*": {}
            }
        }}