Hello all,
I am experiencing some unexpected results when using a terms stats facet.
A little setup:
Our documents have nested documents called "entities". The entities have a
keyword analyzed field called "combined" that uniquely identifies them
within a parent document. They also have a field called "frequency" that
specifies how many times that entity occurs within that particular parent
document.
We need to answer the questions:
- Given an entity E, what are the top 10 other entities (identified by
entities.combined) that co-occur in a document with E - For each of those entities, what is the sum of the frequencies of the
occurrences (across the parent documents)?
We implemented this using a search with E as the criteria to find the
documents of interest and a term stats facet to find the other entities:
{
"query":
{"nested":
{"path": "entities",
"query":
{"term":
{"entities.combined": "Rapp:PERSON"}}}},
"facets":
{"top_entities":
{"terms_stats":
{"key_field": "combined",
"value_field": "frequency",
"order": "total",
"size": 10},
"nested": "entities"}}
}
One of the entities returned by the top_entities facet is:
{
"term" : "Kevin:PERSON",
"count" : 3,
"total_count" : 3,
"min" : 1.0,
"max" : 5.0,
"total" : 10.0,
"mean" : 3.3333333333333335
}
We had reason to believe that the count of 3 was wrong (we already knew
that Rapp:PERSON and Kevin:PERSON occurred in 4 documents together).
We verified this by running a search for documents containing both those
entities:
{
"fields": ["_id"],
"query":
{"bool":
{"must":
[{"nested":
{"path": "entities",
"query": {"term": {"entities.combined": "Rapp:PERSON"}}}},
{"nested":
{"path": "entities",
"query": {"term": {"entities.combined": "Kevin:PERSON"}}}}
]}
}
}
Which gave these results:
"hits" : {
"total" : 4,
"max_score" : 11.749819,
"hits" : [ {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186137998",
"_score" : 11.749819
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138012",
"_score" : 11.748099
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138055",
"_score" : 11.748099
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138026",
"_score" : 11.74794
} ]
}
If we add the facet from above, the example Kevin:PERSON term shows up with
what we believe to be the correct values:
{
"term" : "Kevin:PERSON",
"count" : 4,
"total_count" : 4,
"min" : 1.0,
"max" : 5.0,
"total" : 14.0,
"mean" : 3.5
}
The facet calculation appears to be including an additional document in the
second search, but we checked and all 4 of the hits from the second query
are included in the hits from the first query.
Out of curiosity we added a facet_filter to the original query (restricting
to just the Kevin:PERSON term):
{
"query":
{"nested":
{"path": "entities",
"query":
{"term":
{"entities.combined": "Rapp:PERSON"}}}},
"facets":
{"top_entities":
{"facet_filter":
{"term": {"combined": "Kevin:PERSON"}},
"terms_stats":
{"key_field": "combined",
"value_field": "frequency",
"order": "total",
"size": 10},
"nested": "entities"}}
}
and surprisingly (to us at least), it produced the correct facet value:
{
"term" : "Kevin:PERSON",
"count" : 4,
"total_count" : 4,
"min" : 1.0,
"max" : 5.0,
"total" : 14.0,
"mean" : 3.5
}
We are at a loss as to why our actual query seems to be missing a document
in the facet calculation.
If anyone could shed some light on this, it would be greatly appreciated.
Thanks,
Caleb
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.