Hi All,
I want to get a breakdown of docs by author (ordered by doc count per
author descending, limited to the top 10 authors).
I have 75 documents, each with an author_ids field that is an array of
IDs (strings).
There are 81 unique author IDs. Some docs have several authors. One
author has 2 docs, the rest have 1.
I have the default ES setup on Mac with 5 shards.
The first problem was that the top author wasn't always being included
in the top 10 with a count of 2 docs.
I found this thread:
"Inconsistent facet count":
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/49dfd468528d8d64/a13923cba79e811b
And if I've understood it correctly, it's because the ~15 docs on each
shard don't share an author with a doc on the same shard, so 10 are
being chosen randomly(?), and ~5 being disregarded. The two docs by
the top author are not finding each other in the aggregation because
they are on different shards and one or both of them are in the ~5
being disregarded on each shard.
Correct?
If so, am I approaching my application the wrong way? It seems like
unless the field you want to count values of is the shard key (and
therefore things that would aggregate together are on the same shard),
then you are almost always going to have things missing from the final
counts. That would be okay for fuzzy scoring, but it seems unsuitable
for general aggregation.
I do want to limit to the top 10 authors, but I don't expect to have a
ton of data, and I really want everything to be counted. One solution
might be to have only a single shard. Bad idea?
The second problem is that even increasing the size (of terms for the
facet) to 81, and getting everything back with correct counts, the
(facets) result looks like this:
{
"current_authors": {
"_type": "terms",
"missing": 44
"terms": [
{
"count": 2,
"term": "4e04133210a4846701000061"
},
{
"count": 1,
"term": "4e04133210a484670100009c"
},
{
"count": 1,
"term": "4e04133210a484670100009b"
},
...
],
}
}
I don't know what the "missing" field means, and with all 81 terms
(author IDs) included, I can't imagine what the number 44 represents.
Cheers,
Chris