Double wildcard in string query causes incorrect highlighting for plain and fast vectors highlighters

Hi,

After indexing mapping:
PUT http://localhost:9200/index/_mapping/sometype { "properties" : { "sometext" : { "type" : "string", "term_vector" : "with_positions_offsets" } } }

and data:

POST http://localhost:9200/index/sometype { "sometext" : "A supervisor is responsible for the productivity and actions of a small group of employees. The supervisor has several manager-like roles, responsibilities, and powers. Two of the key differences between a supervisor and a manager are (1) the supervisor does not typically have hire and fire authority, and (2) the supervisor does not have budget authority." }

I'm trying to find all documents, but instead one wildcard I've typed double:
POST http://localhost:9200/index/sometype/_search { "query" : { "query_string" : { "query" : "**", "fields" : ["sometext"] } }, "highlight" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"], "order" : "score", "require_field_match" : true, "fields" : { sometext : { "fragment_size" : 150, "number_of_fragments" : 1 } } } }

and got following highlight:

"highlight" : { "sometext" : ["responsibilities, <em>and</em> <em>powers</em>. <em>Two</em> <em>of</em> <em>the</em> <em>key</em> <em>differences</em> <em>between</em> <em>a</em> <em>supervisor</em> <em>and</em> <em>a</em> <em>manager</em> <em>are</em> (<em>1</em>) <em>the</em> <em>supervisor</em> <em>does</em> <em>not</em> <em>typically</em> <em>have</em> <em>hire</em> <em>and</em> <em>fire</em> <em>authority</em>, and"] }

The same highlighting results are produced by query *? But when query consist of just single asterisk - nothing returned by highlighter.

On plain highlighter (I just added "type" : "plain" to highlight) result looks a bit different (but still weird):

"highlight" : { "sometext" : [", <em>responsibilities</em>, <em>and</em> <em>powers</em>. <em>Two</em> <em>of</em> <em>the</em> <em>key</em> <em>differences</em> <em>between</em> <em>a</em> <em>supervisor</em> <em>and</em> <em>a</em> <em>manager</em> <em>are</em> (<em>1</em>) <em>the</em> <em>supervisor</em> <em>does</em> <em>not</em> <em>typically</em> <em>have</em> <em>hire</em> <em>and</em> <em>fire</em> <em>authority</em>, <em>and</em> (<em>2</em>) <em>the</em> <em>supervisor</em> <em>does</em> <em>not</em> <em>have</em> <em>budget</em> <em>authority</em>."] }

Does anybody know what is the reason of such behavior? Maybe queries like ** and *? have some special meaning? Thanks a lot.

P.S.: I've asked this question on stackoverflow, but nobody post an answer

Comparing the output of the following two validate query API calls shows how the two queries are interpreted:

GET /_validate/query?rewrite=true
{
  "query" : {
    "query_string": {
      "query": "*",
      "fields": [
        "company"
      ]
    }
  }
}

* is a special case in the query_string query and internally the query is rewritten to ConstantScore(_field_names:company) which means "match all documents containing the field 'company' in a filter context and give it a score of 1".

GET /_validate/query?rewrite=true
{
  "query" : {
    "query_string": {
      "query": "**",
      "fields": [
        "company"
      ]
    }
  }
}

** here is interpreted as any other term and so the query is rewritten to "company:**" which means "match any documents with a company field containing a term matching **. Internally this is presumably converted to a wildcard match (I haven't dug that deep).

The reason the second is highlighted and the first is not, is that the second rewritten query still related to matching terms (in fact any term) in the company field, whereas the first query (once rewritten) only relates to the _field_names field (a special meta field which lists the field_names contained in the document) so does not match the fields the highlighter is highlighting. The first query is also rewritten to a filter context so the highlighter will ignore it away as the highlighter is designed to highlight based on query context only.

Hope that helps

Thanks a lot, colings86!
I've got why does it happens from elasticsearch internals point of view.
From user point of view it looks a bit weird. Is this behavior expected or may I add improvement ticket to bugtracker?

Feel free to open an issue for this but I don't think it will be very high priority as this issue is fairly harmless and not a lot of users will see the effects of this.

Thank you again