Number of distinct values for a given field in a query?

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

  • Let's say each document in the corpus has a field called categoryId.
    It's a string field (analyzed as a single token), and in a corpus with 10M
    documents there are maybe 1M unique categories.

  • Suppose a given query or filter (like: "documents created on June 30th,
    2007") matches 100 documents. Clearly there are between 1 and 100 unique
    categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd like
to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

I think what you are looking for is Terms Facethttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

  • Let's say each document in the corpus has a field called categoryId.
    It's a string field (analyzed as a single token), and in a corpus with 10M
    documents there are maybe 1M unique categories.

  • Suppose a given query or filter (like: "documents created on June 30th,
    2007") matches 100 documents. Clearly there are between 1 and 100 unique
    categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd like
to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top 10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

  1. How crazy can I go with the size parameter in the original facet
    request? Can I just set it ridiculously high? The field is marked as
    not_analyzed and guaranteed to be <20 bytes per document. I'm not exactly
    doing this at twitter scale, but I'd like to be able to run < 10 such
    queries at a time without the machines in the cluster running running out
    of memory.
  2. Is there some way I might not be seeing to solving my original question
    (a)? I'd like to get just the number of distinct categoryId values without
    having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facethttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

  • Let's say each document in the corpus has a field called categoryId.
    It's a string field (analyzed as a single token), and in a corpus with 10M
    documents there are maybe 1M unique categories.

  • Suppose a given query or filter (like: "documents created on June 30th,
    2007") matches 100 documents. Clearly there are between 1 and 100 unique
    categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd like
to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

  1. The terms facets are calculated on each shard. The "size" top facets are
    accumulated per shard and then sent to the requesting node for "reducing"
    (merging) individual shard results into one common result. So, large size
    affects the query in two ways: memory and network traffic. It might work
    fine depending on the amount of memory that your nodes have and your
    performance requirements. I would suggest giving it a try to see if it
    meets your needs.

  2. I don't think it's possible at the moment.

On Tuesday, November 13, 2012 4:14:29 AM UTC-5, Ryan Noon wrote:

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top 10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

  1. How crazy can I go with the size parameter in the original facet
    request? Can I just set it ridiculously high? The field is marked as
    not_analyzed and guaranteed to be <20 bytes per document. I'm not exactly
    doing this at twitter scale, but I'd like to be able to run < 10 such
    queries at a time without the machines in the cluster running running out
    of memory.
  2. Is there some way I might not be seeing to solving my original question
    (a)? I'd like to get just the number of distinct categoryId values without
    having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facethttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

  • Let's say each document in the corpus has a field called categoryId.
    It's a string field (analyzed as a single token), and in a corpus with 10M
    documents there are maybe 1M unique categories.

  • Suppose a given query or filter (like: "documents created on June
    30th, 2007") matches 100 documents. Clearly there are between 1 and 100
    unique categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd
like to know is the easiest / most efficient way to get the system to
return answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

Hello,

Just want to add a possible workaround: at least if you have a
timestamp on your documents, you can divide and conquer. For example,
you can get consecutive intervals and do your facet on those. Since
you'd probably get duplicate values, you can't just sum them up.
Instead, you could index the unique values as IDs in a separate
index/type - that will end up containing only unique values.

Then, counting the number of unique values is just a matter of
checking how many documents you have in your new index/type.

The nice part of such a workaround is that you can also use the
timestamp in your new index, so you can easily get the number of
unique values from the last X hours or so. Plus, updating that data
can be done via a cron job that would only facet on documents with a
timestamp newer than the last run (or newest unique ID).

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Nov 13, 2012 at 1:41 PM, Igor Motov imotov@gmail.com wrote:

  1. The terms facets are calculated on each shard. The "size" top facets are
    accumulated per shard and then sent to the requesting node for "reducing"
    (merging) individual shard results into one common result. So, large size
    affects the query in two ways: memory and network traffic. It might work
    fine depending on the amount of memory that your nodes have and your
    performance requirements. I would suggest giving it a try to see if it meets
    your needs.

  2. I don't think it's possible at the moment.

On Tuesday, November 13, 2012 4:14:29 AM UTC-5, Ryan Noon wrote:

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top 10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

  1. How crazy can I go with the size parameter in the original facet
    request? Can I just set it ridiculously high? The field is marked as
    not_analyzed and guaranteed to be <20 bytes per document. I'm not exactly
    doing this at twitter scale, but I'd like to be able to run < 10 such
    queries at a time without the machines in the cluster running running out of
    memory.
  2. Is there some way I might not be seeing to solving my original question
    (a)? I'd like to get just the number of distinct categoryId values without
    having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facet.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I could
use a little help with a query, though.

I'm trying to write a query like this:

  • Let's say each document in the corpus has a field called categoryId.
    It's a string field (analyzed as a single token), and in a corpus with 10M
    documents there are maybe 1M unique categories.

  • Suppose a given query or filter (like: "documents created on June
    30th, 2007") matches 100 documents. Clearly there are between 1 and 100
    unique categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd
like to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

--

Thanks for the responses.

Radu: That's a really interesting workaround! Correct me if I'm wrong,
but wouldn't each query I'm counting the distinct categories on need to
have its own "distinct category" index that I'd keep updated with the cron
job? In my situation I don't really know the queries ahead of time.

This slideshow talks about grouping features in Lucene 4, and I read a few
old emails on this list about this:

It seems like this might also have some potential (the TopGroups object has
a group count, etc). We'll see with the next ES release, but for now I
think I can make due with using facets.

I appreciate the help!

On Tuesday, November 13, 2012 4:16:14 AM UTC-8, Radu Gheorghe wrote:

Hello,

Just want to add a possible workaround: at least if you have a
timestamp on your documents, you can divide and conquer. For example,
you can get consecutive intervals and do your facet on those. Since
you'd probably get duplicate values, you can't just sum them up.
Instead, you could index the unique values as IDs in a separate
index/type - that will end up containing only unique values.

Then, counting the number of unique values is just a matter of
checking how many documents you have in your new index/type.

The nice part of such a workaround is that you can also use the
timestamp in your new index, so you can easily get the number of
unique values from the last X hours or so. Plus, updating that data
can be done via a cron job that would only facet on documents with a
timestamp newer than the last run (or newest unique ID).

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Nov 13, 2012 at 1:41 PM, Igor Motov <imo...@gmail.com<javascript:>>
wrote:

  1. The terms facets are calculated on each shard. The "size" top facets
    are
    accumulated per shard and then sent to the requesting node for
    "reducing"
    (merging) individual shard results into one common result. So, large
    size
    affects the query in two ways: memory and network traffic. It might work
    fine depending on the amount of memory that your nodes have and your
    performance requirements. I would suggest giving it a try to see if it
    meets
    your needs.

  2. I don't think it's possible at the moment.

On Tuesday, November 13, 2012 4:14:29 AM UTC-5, Ryan Noon wrote:

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more
interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top
10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

  1. How crazy can I go with the size parameter in the original facet
    request? Can I just set it ridiculously high? The field is marked as
    not_analyzed and guaranteed to be <20 bytes per document. I'm not
    exactly
    doing this at twitter scale, but I'd like to be able to run < 10 such
    queries at a time without the machines in the cluster running running
    out of
    memory.
  2. Is there some way I might not be seeing to solving my original
    question
    (a)? I'd like to get just the number of distinct categoryId values
    without
    having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facet.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been
really
blown away with how fantastic the software and community is so far.
I could
use a little help with a query, though.

I'm trying to write a query like this:

  • Let's say each document in the corpus has a field called
    categoryId.
    It's a string field (analyzed as a single token), and in a corpus
    with 10M
    documents there are maybe 1M unique categories.

  • Suppose a given query or filter (like: "documents created on June
    30th, 2007") matches 100 documents. Clearly there are between 1 and
    100
    unique categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd
like to know is the easiest / most efficient way to get the system to
return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the
query?
b) What is the actual set of unique categoryId values in the results
for
the query? Bonus points for a histogram of the different categoryId
values.

I've looked into some of the statistical facets and I haven't quite
been
able to wrap my head around all of it. Is there something I'm
missing? It
seems like such a query should be possible without me having to
iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

--

Hello Ryan,

On Tue, Nov 13, 2012 at 10:20 PM, Ryan Noon rmnoon@gmail.com wrote:

Thanks for the responses.

Radu: That's a really interesting workaround! Correct me if I'm wrong, but
wouldn't each query I'm counting the distinct categories on need to have its
own "distinct category" index that I'd keep updated with the cron job? In
my situation I don't really know the queries ahead of time.

If I understand your question correctly, yes. You'd have to keep a
separate index/type for each field for which you count distinct
values.

For example, if your documents look like:
{"user": "john", "book": "war and peace"}

And you'd want to know distinct books and distinct users, you'd have
to maintain two types - say, "distinct_users" and "distinct_books".
Furthermore, if you want to know all the distinct words from book
titles ("war" and "peace" would be independent here), you'd need a
third type - say "distinct_booktitle_words".

On top of that, you'd have to maintain all of them separately, since
these are different queries/facets.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--