Thanks! That was definitely the right page in the documentation.
I wrote a query like this:
{
"query": {
"match_all" : { } // obviously would be something more interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}
When I submit this using search_type=count, I can get back the top 10000
categoryIds, like this:
"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc
This answers my original question (b).
I have a few more questions:
- How crazy can I go with the size parameter in the original facet
request? Can I just set it ridiculously high? The field is marked as
not_analyzed and guaranteed to be <20 bytes per document. I'm not exactly
doing this at twitter scale, but I'd like to be able to run < 10 such
queries at a time without the machines in the cluster running running out
of memory.
- Is there some way I might not be seeing to solving my original question
(a)? I'd like to get just the number of distinct categoryId values without
having to count them on the client.
Thanks!
On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:
I think what you are looking for is Terms Facethttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
.
On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:
Hey folks,
I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.
I'm trying to write a query like this:
-
Let's say each document in the corpus has a field called categoryId.
It's a string field (analyzed as a single token), and in a corpus with 10M
documents there are maybe 1M unique categories.
-
Suppose a given query or filter (like: "documents created on June 30th,
2007") matches 100 documents. Clearly there are between 1 and 100 unique
categoryId values in this set of documents.
I'm not terribly interested in the 100 matching documents. What I'd like
to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.
I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).
Thanks again!
Ryan
--