Hello,
I've been reading here for a while, but this is my first request for help. I hope I have formatted the requests for you properly.
I am using Elastic 2.0, the official release, and the highlighter appears to be returning too much data that I would like to narrow down.
Here is my simplified model, with a nested structure:
curl -XPUT 'http://127.0.0.1:9200/myindex
{ "mappings": {
"mydoctype": {
"dynamic": "strict",
"properties": {
"RootName": {"type": "string"},
"IndicatorFlag": {"type": "boolean"},
"_oCommon": {
"properties": {
"DocumentName": {"type": "string"},
"DocumentType": {"type": "string"},
"RandomValue": {"type": "string"}
}
}
}
}
}
}
Now we insert a document we will later search for:
curl -XPUT 'http://127.0.0.1:9200/myindex/mydoctype/01
{
"RootName":"Root 01", "IndicatorFlag": "true"
,"_oCommon":{"DocumentType":"mydoctype","DocumentName":"yellow","RandomValue":"blue"}
}
Per the rules of my implementation, I need to be able to filter out data of a particular DocumentType, and of a particular RandomValue.
To accomplish this, I am using a bool query: I want to match for a particular string of user entered text (e.g. root*), and I want to apply a filter for a given DocumentType, and of a given RandomValue. To top things off, I want to highlight what we matched upon.
curl -XGET 'http://127.0.0.1:9200/myindex/_search
{
"from": 0,
"size": 10,
"query": {
"query": {
"bool": {
"must": [{"query": {"query_string": {"query": "root*"}}}]
,"filter": [
{"terms": {"_oCommon.DocumentType": ["mydoctype"]}},
{"term": {"_oCommon.RandomValue": "blue"}}
]
}
}
}
,"highlight": {
"fields": {
"*": { "require_field_match":"false" }
}
}
}
The query works as expected. What is not expected, however is the results in the highlighter. The highlighter includes every value within the filter section, because this query returns THREE highlight values to me - the one that matched my query string (root*), and one for each applied filter:
Now, in this simplified model, this is manageable. I can strip the known values I generated the filter with from the return collection, but if you expand to a full blown model, as I have done, then you get a slew of hit highlight results, the majority of which are bits of data that matched your filter term and NOT data that matched your query string section, although they are intertwined.
As shown here, (I apologize, I had to redact most of the names and values) but this is an in-house test Elastic Server with a few hundred thousand documents. You can see in a simulated production environment, where we apply filters for controlling access, we get more filter data that we do not want, than we get highlighting information we do want. Not only are we getting too much, we're getting matches that seemingly do not make sense.
In this example, I filtered for a RandomValue of "other", and you can see I have many "other" matches in a variety of fields. The two items drawn from the left side are the only two that match the query; the remainders are field values that match the values contributed to the filter; if I filter for field X to have a value Y, why does the highlighter show me every instance of value Y in every field within the document? I filtered to ensure field X contained value Y - why give me every value of Y in the same document? This is the behaviour I am seeing and it seems odd to me.
I realize in 2.0 that filters are about context, but what I am missing here?
- How do I get highlight data from within my query but NOT from the filters applied against the query?
- Why do text values that match the text values applied within the filters but are within separate fields returned in hit highlighting?
If I change require_field_match to true instead of false, I ONLY get the ones that match my filter, and none that match my field.
Thanks much for any insight.