Number of distinct values for a given field in a query?

Ryan_Noon · November 12, 2012, 11:46pm

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

Let's say each document in the corpus has a field called categoryId.
It's a string field (analyzed as a single token), and in a corpus with 10M
documents there are maybe 1M unique categories.
Suppose a given query or filter (like: "documents created on June 30th,
2007") matches 100 documents. Clearly there are between 1 and 100 unique
categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd like
to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

Igor_Motov · November 13, 2012, 3:03am

I think what you are looking for is Terms Facethttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

Let's say each document in the corpus has a field called categoryId.
It's a string field (analyzed as a single token), and in a corpus with 10M
documents there are maybe 1M unique categories.

Suppose a given query or filter (like: "documents created on June 30th,
2007") matches 100 documents. Clearly there are between 1 and 100 unique
categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd like
to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

Ryan_Noon · November 13, 2012, 9:14am

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top 10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

How crazy can I go with the size parameter in the original facet
request? Can I just set it ridiculously high? The field is marked as
not_analyzed and guaranteed to be <20 bytes per document. I'm not exactly
doing this at twitter scale, but I'd like to be able to run < 10 such
queries at a time without the machines in the cluster running running out
of memory.
Is there some way I might not be seeing to solving my original question
(a)? I'd like to get just the number of distinct categoryId values without
having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facethttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

Let's say each document in the corpus has a field called categoryId.
It's a string field (analyzed as a single token), and in a corpus with 10M
documents there are maybe 1M unique categories.

Suppose a given query or filter (like: "documents created on June 30th,
2007") matches 100 documents. Clearly there are between 1 and 100 unique
categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd like
to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

Igor_Motov · November 13, 2012, 11:41am

The terms facets are calculated on each shard. The "size" top facets are
accumulated per shard and then sent to the requesting node for "reducing"
(merging) individual shard results into one common result. So, large size
affects the query in two ways: memory and network traffic. It might work
fine depending on the amount of memory that your nodes have and your
performance requirements. I would suggest giving it a try to see if it
meets your needs.
I don't think it's possible at the moment.

On Tuesday, November 13, 2012 4:14:29 AM UTC-5, Ryan Noon wrote:

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top 10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

How crazy can I go with the size parameter in the original facet
request? Can I just set it ridiculously high? The field is marked as
not_analyzed and guaranteed to be <20 bytes per document. I'm not exactly
doing this at twitter scale, but I'd like to be able to run < 10 such
queries at a time without the machines in the cluster running running out
of memory.

Is there some way I might not be seeing to solving my original question
(a)? I'd like to get just the number of distinct categoryId values without
having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facethttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I
could use a little help with a query, though.

I'm trying to write a query like this:

Let's say each document in the corpus has a field called categoryId.
It's a string field (analyzed as a single token), and in a corpus with 10M
documents there are maybe 1M unique categories.

Suppose a given query or filter (like: "documents created on June
30th, 2007") matches 100 documents. Clearly there are between 1 and 100
unique categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd
like to know is the easiest / most efficient way to get the system to
return answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

radu_gheorghe · November 13, 2012, 12:16pm

Hello,

Just want to add a possible workaround: at least if you have a
timestamp on your documents, you can divide and conquer. For example,
you can get consecutive intervals and do your facet on those. Since
you'd probably get duplicate values, you can't just sum them up.
Instead, you could index the unique values as IDs in a separate
index/type - that will end up containing only unique values.

Then, counting the number of unique values is just a matter of
checking how many documents you have in your new index/type.

The nice part of such a workaround is that you can also use the
timestamp in your new index, so you can easily get the number of
unique values from the last X hours or so. Plus, updating that data
can be done via a cron job that would only facet on documents with a
timestamp newer than the last run (or newest unique ID).

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Nov 13, 2012 at 1:41 PM, Igor Motov imotov@gmail.com wrote:

The terms facets are calculated on each shard. The "size" top facets are
accumulated per shard and then sent to the requesting node for "reducing"
(merging) individual shard results into one common result. So, large size
affects the query in two ways: memory and network traffic. It might work
fine depending on the amount of memory that your nodes have and your
performance requirements. I would suggest giving it a try to see if it meets
your needs.

I don't think it's possible at the moment.

On Tuesday, November 13, 2012 4:14:29 AM UTC-5, Ryan Noon wrote:

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top 10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

How crazy can I go with the size parameter in the original facet
request? Can I just set it ridiculously high? The field is marked as
not_analyzed and guaranteed to be <20 bytes per document. I'm not exactly
doing this at twitter scale, but I'd like to be able to run < 10 such
queries at a time without the machines in the cluster running running out of
memory.

Is there some way I might not be seeing to solving my original question
(a)? I'd like to get just the number of distinct categoryId values without
having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facet.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been really
blown away with how fantastic the software and community is so far. I could
use a little help with a query, though.

I'm trying to write a query like this:

Let's say each document in the corpus has a field called categoryId.
It's a string field (analyzed as a single token), and in a corpus with 10M
documents there are maybe 1M unique categories.

Suppose a given query or filter (like: "documents created on June
30th, 2007") matches 100 documents. Clearly there are between 1 and 100
unique categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd
like to know is the easiest / most efficient way to get the system to return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the query?
b) What is the actual set of unique categoryId values in the results for
the query? Bonus points for a histogram of the different categoryId values.

I've looked into some of the statistical facets and I haven't quite been
able to wrap my head around all of it. Is there something I'm missing? It
seems like such a query should be possible without me having to iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

--

Ryan_Noon · November 13, 2012, 8:20pm

Thanks for the responses.

Radu: That's a really interesting workaround! Correct me if I'm wrong,
but wouldn't each query I'm counting the distinct categories on need to
have its own "distinct category" index that I'd keep updated with the cron
job? In my situation I don't really know the queries ahead of time.

This slideshow talks about grouping features in Lucene 4, and I read a few
old emails on this list about this:

It seems like this might also have some potential (the TopGroups object has
a group count, etc). We'll see with the next ES release, but for now I
think I can make due with using facets.

I appreciate the help!

On Tuesday, November 13, 2012 4:16:14 AM UTC-8, Radu Gheorghe wrote:

Hello,

Just want to add a possible workaround: at least if you have a
timestamp on your documents, you can divide and conquer. For example,
you can get consecutive intervals and do your facet on those. Since
you'd probably get duplicate values, you can't just sum them up.
Instead, you could index the unique values as IDs in a separate
index/type - that will end up containing only unique values.

Then, counting the number of unique values is just a matter of
checking how many documents you have in your new index/type.

The nice part of such a workaround is that you can also use the
timestamp in your new index, so you can easily get the number of
unique values from the last X hours or so. Plus, updating that data
can be done via a cron job that would only facet on documents with a
timestamp newer than the last run (or newest unique ID).

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Nov 13, 2012 at 1:41 PM, Igor Motov <imo...@gmail.com<javascript:>>
wrote:

The terms facets are calculated on each shard. The "size" top facets
are
accumulated per shard and then sent to the requesting node for
"reducing"
(merging) individual shard results into one common result. So, large
size
affects the query in two ways: memory and network traffic. It might work
fine depending on the amount of memory that your nodes have and your
performance requirements. I would suggest giving it a try to see if it
meets
your needs.

I don't think it's possible at the moment.

On Tuesday, November 13, 2012 4:14:29 AM UTC-5, Ryan Noon wrote:

Thanks! That was definitely the right page in the documentation.

I wrote a query like this:

{
"query": {
"match_all" : { } // obviously would be something more
interesting
},
"facets" : {
"categoryId" : {
"terms" : {
"field" : "categoryId",
"size" : 10000
}
}
}
}

When I submit this using search_type=count, I can get back the top
10000
categoryIds, like this:

"facets" : {
"categoryId" : {
"_type" : "terms",
"missing" : 0,
"total" : 1295215,
"other" : 301713,
"terms" : [ {
"term" : "person_256",
"count" : 10753
}, {
"term" : "person_253",
"count" : 8688
}, {
"term" : "person_3113",
"count" : 7212
}, {
"term" : "person_288",
"count" : 7082
}, // etc

This answers my original question (b).

I have a few more questions:

How crazy can I go with the size parameter in the original facet
request? Can I just set it ridiculously high? The field is marked as
not_analyzed and guaranteed to be <20 bytes per document. I'm not
exactly
doing this at twitter scale, but I'd like to be able to run < 10 such
queries at a time without the machines in the cluster running running
out of
memory.

Is there some way I might not be seeing to solving my original
question
(a)? I'd like to get just the number of distinct categoryId values
without
having to count them on the client.

Thanks!

On Monday, November 12, 2012 7:03:21 PM UTC-8, Igor Motov wrote:

I think what you are looking for is Terms Facet.

On Monday, November 12, 2012 6:46:12 PM UTC-5, Ryan Noon wrote:

Hey folks,

I just started using elasticsearch a few months ago and I've been
really
blown away with how fantastic the software and community is so far.
I could
use a little help with a query, though.

I'm trying to write a query like this:

Let's say each document in the corpus has a field called
categoryId.
It's a string field (analyzed as a single token), and in a corpus
with 10M
documents there are maybe 1M unique categories.

Suppose a given query or filter (like: "documents created on June
30th, 2007") matches 100 documents. Clearly there are between 1 and
100
unique categoryId values in this set of documents.

I'm not terribly interested in the 100 matching documents. What I'd
like to know is the easiest / most efficient way to get the system to
return
answers for the following two questions:
a) How many distinct categoryId values are in the results for the
query?
b) What is the actual set of unique categoryId values in the results
for
the query? Bonus points for a histogram of the different categoryId
values.

I've looked into some of the statistical facets and I haven't quite
been
able to wrap my head around all of it. Is there something I'm
missing? It
seems like such a query should be possible without me having to
iterate
through all the results and build my own HashSet =).

Thanks again!
Ryan

--

--

radu_gheorghe · November 14, 2012, 9:50am

Hello Ryan,

On Tue, Nov 13, 2012 at 10:20 PM, Ryan Noon rmnoon@gmail.com wrote:

Thanks for the responses.

Radu: That's a really interesting workaround! Correct me if I'm wrong, but
wouldn't each query I'm counting the distinct categories on need to have its
own "distinct category" index that I'd keep updated with the cron job? In
my situation I don't really know the queries ahead of time.

If I understand your question correctly, yes. You'd have to keep a
separate index/type for each field for which you count distinct
values.

For example, if your documents look like:
{"user": "john", "book": "war and peace"}

And you'd want to know distinct books and distinct users, you'd have
to maintain two types - say, "distinct_users" and "distinct_books".
Furthermore, if you want to know all the distinct words from book
titles ("war" and "peace" would be independent here), you'd need a
third type - say "distinct_booktitle_words".

On top of that, you'd have to maintain all of them separately, since
these are different queries/facets.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Topic		Replies	Views
Distinct count of many fields in a single ES query Elasticsearch	1	779	February 1, 2018
Facet on two fields Elasticsearch	1	286	July 6, 2017
Aggregation Module - value_count problem Elasticsearch	2	399	July 6, 2017
How Can I Perform a Distinct Query? Elasticsearch	4	5495	July 6, 2017
Distinct count for field and High Cardinality Facets Elasticsearch	9	745	July 6, 2017

Number of distinct values for a given field in a query?

Best regards, Radu

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu