Facet counts and the "missing" field

Hi All,

I want to get a breakdown of docs by author (ordered by doc count per
author descending, limited to the top 10 authors).

I have 75 documents, each with an author_ids field that is an array of
IDs (strings).
There are 81 unique author IDs. Some docs have several authors. One
author has 2 docs, the rest have 1.
I have the default ES setup on Mac with 5 shards.

The first problem was that the top author wasn't always being included
in the top 10 with a count of 2 docs.

I found this thread:
"Inconsistent facet count":
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/49dfd468528d8d64/a13923cba79e811b
And if I've understood it correctly, it's because the ~15 docs on each
shard don't share an author with a doc on the same shard, so 10 are
being chosen randomly(?), and ~5 being disregarded. The two docs by
the top author are not finding each other in the aggregation because
they are on different shards and one or both of them are in the ~5
being disregarded on each shard.

Correct?

If so, am I approaching my application the wrong way? It seems like
unless the field you want to count values of is the shard key (and
therefore things that would aggregate together are on the same shard),
then you are almost always going to have things missing from the final
counts. That would be okay for fuzzy scoring, but it seems unsuitable
for general aggregation.

I do want to limit to the top 10 authors, but I don't expect to have a
ton of data, and I really want everything to be counted. One solution
might be to have only a single shard. Bad idea?

The second problem is that even increasing the size (of terms for the
facet) to 81, and getting everything back with correct counts, the
(facets) result looks like this:
{
"current_authors": {
"_type": "terms",
"missing": 44
"terms": [
{
"count": 2,
"term": "4e04133210a4846701000061"
},
{
"count": 1,
"term": "4e04133210a484670100009c"
},
{
"count": 1,
"term": "4e04133210a484670100009b"
},
...
],

}

}

I don't know what the "missing" field means, and with all 81 terms
(author IDs) included, I can't imagine what the number 44 represents.

Cheers,
Chris

Update: tried with a single shard. Counts seem fine, but "missing"
went up to 96. That field is still very much a mystery to me.

Thanks in advance for any help on this!

Hey,

Yea, your logic is sound regarding computation of the facets. Missing means docs that have no value for that field, if you think its misbehaving, can you gist a recreation?

Very "long tail" values require either fetching all tags (81 is a low number, its ok), or partition based on a different value (compared to the doc id), which in your case is problematic since a doc can have several author ids.

On Friday, June 24, 2011 at 1:55 PM, Chris Berkhout wrote:

Update: tried with a single shard. Counts seem fine, but "missing"
went up to 96. That field is still very much a mystery to me.

Thanks in advance for any help on this!

Hey Shay,

Thanks for the quick response!

It's making more sense now. I think single shard is the way to go for
now, then reassess when there's more data.

It looks like 'missing' is misbehaving, so I'll come back to you with
a gist recreation soon.

If I find the time I'll add some notes on distributed facet
calculation and 'missing' to the ES guide docs and send a pull
request.

Cheers,
Chris

On Fri, Jun 24, 2011 at 8:10 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Hey,
Yea, your logic is sound regarding computation of the facets. Missing
means docs that have no value for that field, if you think its misbehaving,
can you gist a recreation?
Very "long tail" values require either fetching all tags (81 is a low
number, its ok), or partition based on a different value (compared to the
doc id), which in your case is problematic since a doc can have several
author ids.

On Friday, June 24, 2011 at 1:55 PM, Chris Berkhout wrote:

Update: tried with a single shard. Counts seem fine, but "missing"
went up to 96. That field is still very much a mystery to me.

Thanks in advance for any help on this!

Here's a full gist recreation of the strange 'missing' number:

Cheers,
Chris

On Fri, Jun 24, 2011 at 9:41 PM, Chris Berkhout chrisberkhout@gmail.com wrote:

Hey Shay,

Thanks for the quick response!

It's making more sense now. I think single shard is the way to go for
now, then reassess when there's more data.

It looks like 'missing' is misbehaving, so I'll come back to you with
a gist recreation soon.

If I find the time I'll add some notes on distributed facet
calculation and 'missing' to the ES guide docs and send a pull
request.

Cheers,
Chris

On Fri, Jun 24, 2011 at 8:10 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Hey,
Yea, your logic is sound regarding computation of the facets. Missing
means docs that have no value for that field, if you think its misbehaving,
can you gist a recreation?
Very "long tail" values require either fetching all tags (81 is a low
number, its ok), or partition based on a different value (compared to the
doc id), which in your case is problematic since a doc can have several
author ids.

On Friday, June 24, 2011 at 1:55 PM, Chris Berkhout wrote:

Update: tried with a single shard. Counts seem fine, but "missing"
went up to 96. That field is still very much a mystery to me.

Thanks in advance for any help on this!