Missing expected document from terms stats facet

Hello all,
I am experiencing some unexpected results when using a terms stats facet.

A little setup:

Our documents have nested documents called "entities". The entities have a
keyword analyzed field called "combined" that uniquely identifies them
within a parent document. They also have a field called "frequency" that
specifies how many times that entity occurs within that particular parent

We need to answer the questions:

  1. Given an entity E, what are the top 10 other entities (identified by
    entities.combined) that co-occur in a document with E
  2. For each of those entities, what is the sum of the frequencies of the
    occurrences (across the parent documents)?

We implemented this using a search with E as the criteria to find the
documents of interest and a term stats facet to find the other entities:

{"path": "entities",
{"entities.combined": "Rapp:PERSON"}}}},
{"key_field": "combined",
"value_field": "frequency",
"order": "total",
"size": 10},
"nested": "entities"}}

One of the entities returned by the top_entities facet is:

"term" : "Kevin:PERSON",
"count" : 3,
"total_count" : 3,
"min" : 1.0,
"max" : 5.0,
"total" : 10.0,
"mean" : 3.3333333333333335

We had reason to believe that the count of 3 was wrong (we already knew
that Rapp:PERSON and Kevin:PERSON occurred in 4 documents together).

We verified this by running a search for documents containing both those
"fields": ["_id"],
{"path": "entities",
"query": {"term": {"entities.combined": "Rapp:PERSON"}}}},
{"path": "entities",
"query": {"term": {"entities.combined": "Kevin:PERSON"}}}}

Which gave these results:

"hits" : {
"total" : 4,
"max_score" : 11.749819,
"hits" : [ {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186137998",
"_score" : 11.749819
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138012",
"_score" : 11.748099
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138055",
"_score" : 11.748099
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138026",
"_score" : 11.74794
} ]

If we add the facet from above, the example Kevin:PERSON term shows up with
what we believe to be the correct values:

"term" : "Kevin:PERSON",
"count" : 4,
"total_count" : 4,
"min" : 1.0,
"max" : 5.0,
"total" : 14.0,
"mean" : 3.5

The facet calculation appears to be including an additional document in the
second search, but we checked and all 4 of the hits from the second query
are included in the hits from the first query.

Out of curiosity we added a facet_filter to the original query (restricting
to just the Kevin:PERSON term):

{"path": "entities",
{"entities.combined": "Rapp:PERSON"}}}},
{"term": {"combined": "Kevin:PERSON"}},
{"key_field": "combined",
"value_field": "frequency",
"order": "total",
"size": 10},
"nested": "entities"}}

and surprisingly (to us at least), it produced the correct facet value:

"term" : "Kevin:PERSON",
"count" : 4,
"total_count" : 4,
"min" : 1.0,
"max" : 5.0,
"total" : 14.0,
"mean" : 3.5

We are at a loss as to why our actual query seems to be missing a document
in the facet calculation.

If anyone could shed some light on this, it would be greatly appreciated.


You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.