Missing expected document from terms stats facet

Hello all,
I am experiencing some unexpected results when using a terms stats facet.

A little setup:

Our documents have nested documents called "entities". The entities have a
keyword analyzed field called "combined" that uniquely identifies them
within a parent document. They also have a field called "frequency" that
specifies how many times that entity occurs within that particular parent
document.

We need to answer the questions:

  1. Given an entity E, what are the top 10 other entities (identified by
    entities.combined) that co-occur in a document with E
  2. For each of those entities, what is the sum of the frequencies of the
    occurrences (across the parent documents)?

We implemented this using a search with E as the criteria to find the
documents of interest and a term stats facet to find the other entities:

{
"query":
{"nested":
{"path": "entities",
"query":
{"term":
{"entities.combined": "Rapp:PERSON"}}}},
"facets":
{"top_entities":
{"terms_stats":
{"key_field": "combined",
"value_field": "frequency",
"order": "total",
"size": 10},
"nested": "entities"}}
}

One of the entities returned by the top_entities facet is:

{
"term" : "Kevin:PERSON",
"count" : 3,
"total_count" : 3,
"min" : 1.0,
"max" : 5.0,
"total" : 10.0,
"mean" : 3.3333333333333335
}

We had reason to believe that the count of 3 was wrong (we already knew
that Rapp:PERSON and Kevin:PERSON occurred in 4 documents together).

We verified this by running a search for documents containing both those
entities:
{
"fields": ["_id"],
"query":
{"bool":
{"must":
[{"nested":
{"path": "entities",
"query": {"term": {"entities.combined": "Rapp:PERSON"}}}},
{"nested":
{"path": "entities",
"query": {"term": {"entities.combined": "Kevin:PERSON"}}}}
]}
}
}

Which gave these results:

"hits" : {
"total" : 4,
"max_score" : 11.749819,
"hits" : [ {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186137998",
"_score" : 11.749819
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138012",
"_score" : 11.748099
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138055",
"_score" : 11.748099
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "17592186138026",
"_score" : 11.74794
} ]
}

If we add the facet from above, the example Kevin:PERSON term shows up with
what we believe to be the correct values:

{
"term" : "Kevin:PERSON",
"count" : 4,
"total_count" : 4,
"min" : 1.0,
"max" : 5.0,
"total" : 14.0,
"mean" : 3.5
}

The facet calculation appears to be including an additional document in the
second search, but we checked and all 4 of the hits from the second query
are included in the hits from the first query.

Out of curiosity we added a facet_filter to the original query (restricting
to just the Kevin:PERSON term):

{
"query":
{"nested":
{"path": "entities",
"query":
{"term":
{"entities.combined": "Rapp:PERSON"}}}},
"facets":
{"top_entities":
{"facet_filter":
{"term": {"combined": "Kevin:PERSON"}},
"terms_stats":
{"key_field": "combined",
"value_field": "frequency",
"order": "total",
"size": 10},
"nested": "entities"}}
}

and surprisingly (to us at least), it produced the correct facet value:

{
"term" : "Kevin:PERSON",
"count" : 4,
"total_count" : 4,
"min" : 1.0,
"max" : 5.0,
"total" : 14.0,
"mean" : 3.5
}

We are at a loss as to why our actual query seems to be missing a document
in the facet calculation.

If anyone could shed some light on this, it would be greatly appreciated.

Thanks,
Caleb

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.