2.0 Question about what's returning within the Highlighter

Hello,
I've been reading here for a while, but this is my first request for help. I hope I have formatted the requests for you properly.

I am using Elastic 2.0, the official release, and the highlighter appears to be returning too much data that I would like to narrow down.

Here is my simplified model, with a nested structure:

curl -XPUT 'http://127.0.0.1:9200/myindex
{    "mappings": {
        "mydoctype": {
            "dynamic": "strict",
            "properties": {
               "RootName": {"type": "string"},
               "IndicatorFlag": {"type": "boolean"},
               "_oCommon": {
                  "properties": {
                     "DocumentName": {"type": "string"},
                     "DocumentType": {"type": "string"},
                     "RandomValue": {"type": "string"}
                  }
               }
            }
         }
    }
}

Now we insert a document we will later search for:

curl -XPUT 'http://127.0.0.1:9200/myindex/mydoctype/01
{
  "RootName":"Root 01", "IndicatorFlag": "true"
  ,"_oCommon":{"DocumentType":"mydoctype","DocumentName":"yellow","RandomValue":"blue"}
}

Per the rules of my implementation, I need to be able to filter out data of a particular DocumentType, and of a particular RandomValue.

To accomplish this, I am using a bool query: I want to match for a particular string of user entered text (e.g. root*), and I want to apply a filter for a given DocumentType, and of a given RandomValue. To top things off, I want to highlight what we matched upon.

curl -XGET 'http://127.0.0.1:9200/myindex/_search
{
    "from": 0,
    "size": 10,
    "query": {
         "query": {
            "bool": {
                "must": [{"query": {"query_string": {"query": "root*"}}}]
                ,"filter": [
                    {"terms": {"_oCommon.DocumentType": ["mydoctype"]}},
                    {"term": {"_oCommon.RandomValue": "blue"}}
                ]
            }
        }
    }
    ,"highlight": {
        "fields": {
            "*": { "require_field_match":"false" }
        }
    }
}

The query works as expected. What is not expected, however is the results in the highlighter. The highlighter includes every value within the filter section, because this query returns THREE highlight values to me - the one that matched my query string (root*), and one for each applied filter:

Now, in this simplified model, this is manageable. I can strip the known values I generated the filter with from the return collection, but if you expand to a full blown model, as I have done, then you get a slew of hit highlight results, the majority of which are bits of data that matched your filter term and NOT data that matched your query string section, although they are intertwined.

As shown here, (I apologize, I had to redact most of the names and values) but this is an in-house test Elastic Server with a few hundred thousand documents. You can see in a simulated production environment, where we apply filters for controlling access, we get more filter data that we do not want, than we get highlighting information we do want. Not only are we getting too much, we're getting matches that seemingly do not make sense.

In this example, I filtered for a RandomValue of "other", and you can see I have many "other" matches in a variety of fields. The two items drawn from the left side are the only two that match the query; the remainders are field values that match the values contributed to the filter; if I filter for field X to have a value Y, why does the highlighter show me every instance of value Y in every field within the document? I filtered to ensure field X contained value Y - why give me every value of Y in the same document? This is the behaviour I am seeing and it seems odd to me.

I realize in 2.0 that filters are about context, but what I am missing here?

  1. How do I get highlight data from within my query but NOT from the filters applied against the query?
  2. Why do text values that match the text values applied within the filters but are within separate fields returned in hit highlighting?

If I change require_field_match to true instead of false, I ONLY get the ones that match my filter, and none that match my field.

Thanks much for any insight.

Anyone? Can someone please test it out? I gave you a fully repeatable example. Thank you.

Thanks for the bug report, I opened a bug at https://github.com/elastic/elasticsearch/issues/16705

Hello,
the reason why if you set require_field_match to false you don't get any of the query matches highlighted is that you don't specify a field in your query_string. You can either specify it in the query itself or using the default_field parameter. You can alternately specify fields to search against multiple fields, and/or use patterns there as well.

I do not think that the fact that filter matches has recently changed due to query/filter merging. If you don't want filters to be highlighted I would specify a specific highlight_query in the highlight section, including the query_string and leaving out the rest.

Hope this helps.
Luca

I stand corrected on the query/filter problem, I double checked and indeed this has changed as a side effect of query/filter merging. The same query (using a filtered query) would highlight only the query part in 1.7 and previous versions, while now filters get highlighted as well with 2.0+. Using highlight_query as a described is a valid workaround until we get this fixed.

Cheers
Luca

Hello,

Is there any progress on the highlighter issue? I started this thread, which mentions a bug that was opened when I met you folks at Elasticon back in February (https://github.com/elastic/elasticsearch/issues/16705) but that item is closed, and references another highlight issue in item 16709, which is also marked closed, but I don't see the resolution, per se.

I have tested 2.3.3 - even when providing a specific highlighter query, it is still yielding wrong results.

We are expecting to release a product in the fall, and highlighting is a key part of our search results. Can you please let us know when this might be resolved. Truthfully, we cannot go live without it.

thanks.

Hi,
https://github.com/elastic/elasticsearch/issues/16705 is not closed, and it has an assignee, let me double check what the plan is and will try to update the issue. Sorry that this is taking a long time.

Cheers
Luca

Any update? I have just run into this issue today, and was surprised that it has been a known bug for so long. As a novice, it is always painful to run into such bugs, because the initial assumption is that I am doing something wrong...

Thanks!