Elasticsearch 5.0.0 rc1 highlight incorrectly with hunspell

edeak · October 21, 2016, 2:07pm

Hi,

my issue is basically related to the hungarian language, but very strange.
Consider the following word: "alma" (apple in english).
This is a noun, but it's also a genitive of "alom" (bedding).

Here's the stange behaviour:

in ES 2.x if I searched with for the word "alma" with highlighting, I found related documents for the word "alma" with highlight around.
in ES 5.0.0 rc1 if I do the same, "alma" is not highlighted. If I search for the word "alom", I get back "alma" highlighted.

Both ES version finds the documents where "alma" or any other inflected form exists, but highlighting is strange.

Note that I installed hunspell just by copying the dictionary files to the proper place (/etc/elasticsearch/hunspell) with the hungarian files, and here's the configuration for my index:

{
	"settings": {
		"analysis": {
          "filter": {
            "hu_HU": {
              "locale": "hu_HU",
              "type": "hunspell"
            }
          },
          "analyzer": {
            "no_filter": {
              "char_filter": "html_strip",
              "tokenizer": "whitespace"
            },
            "hu": {
              "filter": [
                "lowercase",
                "hu_HU"
              ],
              "char_filter": "html_strip",
              "tokenizer": "standard"
            }
          }
        }
	},
	"mappings": {
		"test": {
			"properties": {
				"content": { "type": "text", "analyzer": "hu", "term_vector": "with_positions_offsets_payloads" }
			}
		}
	}
}

TEST DOCUMENT:

PUT ES:9200/test_index/test/1 
{
    "content": "Az alma egy nagyon finom dolog."
}

TEST QUERY:

{
	"query": {
		"bool": {
			"should": [
				{
					"multi_match": {
						"query": "", #alma - no highlight, #almák - highlight, #alom - highlight
						"type": "best_fields",
						"fields": ["content"]
					}
				}
			],
			"minimum_should_match": 1
		}
	},
	"highlight": {
		"pre_tags": ["<highlight>"],
		"post_tags": ["</highlight>"],
		"fields": {
			"content": {
				"number_of_fragments": 0
			}
		}
	}
}

Mark_Harwood · October 21, 2016, 4:00pm

Just to help narrow things down - does this also break if you use "type":"plain" on the highlight?

edeak · October 24, 2016, 8:22am

Thanks, this does the trick, now it's working!
Another thing is that as you can see I indexed the field with the setting "term_vector": "with_positions_offsets_payloads".
Now when I set "type": "postings" in the highlight, I get the following response:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "the field [content] should be indexed with positions and offsets in the postings list to be used with postings highlighter"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query_fetch",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test_string",
        "node": "bMJ9xy_bRZ-qIDX4n4f-rA",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "the field [content] should be indexed with positions and offsets in the postings list to be used with postings highlighter"
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "the field [content] should be indexed with positions and offsets in the postings list to be used with postings highlighter"
    }
  },
  "status": 400
}

Any idea for this?

Mark_Harwood · October 24, 2016, 8:39am

The number of available highlighters is an indication of how tricky the problem of highlighting can be.
Each of these was an attempt by one or more people to "fix" the problem of highlighting where previous highlighters had shortcomings.
Some of the highlighters address the problem by creating special data structures at index-time to support the highlighting process. These data structure choices are configured in the mapping definition and so turning on these options effectively dictates the choice of Highlighter implementation used by default for that field. Each Highlighter implementation documents the type of data structures it requires in the mapping e.g. the Postings highlighter [1]

Despite your mapping choices you can always revert to the "type:plain" highlighter as this does not require any special index structures (but as a consequence many not be as fast as other implementations). As you have discovered though, sometimes faster Highlighter implementations may not work as well as plain highlighter for certain choices of analyzers.

Sadly, highlighting is a tricky business !

Cheers
Mark

[1] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-highlighting.html#postings-highlighter

edeak · October 24, 2016, 9:13am

Mark, thanks for your help!