Grouping entries together in a query, need some help with aggregations


(Vincent Massol) #1

Hi guys,

I'd like to count all entries in my ES instance, having a timestamp from
the last day and group together all entries having the same "instanceId".
With the data below, the count result should be 1 (and not 2) since 2
entries are within the last day but they have the same instanceId of "def".

I tried the following:

curl -XPOST
"http://localhost:9200/installs/install/_search?pretty=1&fields=_source,_timestamp"
-d'
{
"aggs": {
"lastday" : {
"filter" : {
"range" : {
"_timestamp" : {
"gt" : "now-1d"
}
}
},
"aggs" : {
"instanceids" : {
"terms" : { "field" : "instanceId" }
}
}
}
}
}'

But I have 3 problems with this:

  • It's not a count but a search. "aggs" don't seem to work with _count
  • It returns all entries in the result before the aggs data
  • In the aggs I don't get a direct count value and I have to count the
    number of buckets to get my answer

I'm pretty sure there's a simpler way but I'm having a hard time figuring
it out. Also could this query be expressed fully in the Query DSL?

Data:

curl -XDELETE "http://localhost:9200/installs"

curl -XPUT "http://localhost:9200/installs"

curl -XPUT "http://localhost:9200/installs/install/_mapping" -d'
{
"install" : {
"_timestamp" : {
"enabled" : true,
"store" : true
},
"properties" : {
"formatVersion" : { "type" : "string", "index" : "not_analyzed" },
"instanceId" : { "type" : "string", "index" : "not_analyzed" },
"distributionId" : { "type" : "string", "index" : "not_analyzed" },
"distributionVersion" : { "type" : "string", "index" : "not_analyzed"
}
}
}
}'

curl -XPOST "http://localhost:9200/installs/install?timestamp=2014-03-20"
-d'
{
"formatVersion" : "2.0",
"instanceId" : "abc",
"distributionId" : "org.xwiki.enterprise:xwiki-enterprise-web",
"distributionVersion" : "6.0-milestone-1"
}'

curl -XPOST "http://localhost:9200/installs/install" -d'
{
"formatVersion" : "2.0",
"instanceId" : "def",
"distributionId" : "org.xwiki.enterprise:xwiki-enterprise-web",
"distributionVersion" : "5.4.3"
}'

curl -XPOST "http://localhost:9200/installs/install" -d'
{
"formatVersion" : "2.0",
"instanceId" : "def",
"distributionId" : "org.xwiki.enterprise:xwiki-enterprise-web",
"distributionVersion" : "5.4.3"
}'

Thanks a lot for any help or pointers.
-Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/162805ff-5fa8-4a9a-9c77-a13922c09486%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

Hi Vincent,

I left some replies inline:

On Wed, Apr 2, 2014 at 10:02 AM, Vincent Massol vmassol@gmail.com wrote:

Hi guys,

I'd like to count all entries in my ES instance, having a timestamp from
the last day and group together all entries having the same
"instanceId"
. With the data below, the count result should be 1 (and not
2) since 2 entries are within the last day but they have the same
instanceId of "def".

I tried the following:

curl -XPOST "
http://localhost:9200/installs/install/_search?pretty=1&fields=_source,_timestamp"
-d'
{
"aggs": {
"lastday" : {
"filter" : {
"range" : {
"_timestamp" : {
"gt" : "now-1d"
}
}
},
"aggs" : {
"instanceids" : {
"terms" : { "field" : "instanceId" }
}
}
}
}
}'

But I have 3 problems with this:

  • It's not a count but a search. "aggs" don't seem to work with _count
  • It returns all entries in the result before the aggs data

For these two issues, you probably want to check out the count search
type[1] which works with aggregations. It's like a regular search, but
doesn't do perform the fetch phase in order to fetch the top hits.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#count

  • In the aggs I don't get a direct count value and I have to count the
    number of buckets to get my answer

We recently (Elasticsearch 1.1.0) added a cardinality[2] aggregation, that
allows for counting unique values. In previous versions of Elasticsearch,
counting was indeed only possible through the terms aggregation with a high
size parameter, but this was inefficient on high-cardinality fields.

[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#search-aggregations-metrics-cardinality-aggregation

Here is a gist that gives an example of the count search_type and the
cardinality aggregation:
https://gist.github.com/jpountz/9930690

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4BaSGiyoNoSdu6qCxjjU4n1xCh3hT35cmcTGPmemcLtg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Vincent Massol) #3

Thanks a lot for your fast response Adrien!

  • I noticed the cardinality aggregation but I was worried by the "an
    approximate count of distinct values." part of the documentation. I need an
    exact value, not an approximate one :slight_smile: However I've read more the
    documentation and it may not be a real problem in practice, especially if I
    use a threshold of 40000 (the max apparently). I couldn't find the default
    precision value BTW in the documentation.
  • From your answer I gather that using aggregations is the only solution to
    my problem and there's no way to use the Query DSL to solve it.

Thanks, it helps a lot!
-Vincent

On Wednesday, April 2, 2014 11:17:17 AM UTC+2, Adrien Grand wrote:

Hi Vincent,

I left some replies inline:

On Wed, Apr 2, 2014 at 10:02 AM, Vincent Massol <vma...@gmail.com<javascript:>

wrote:

Hi guys,

I'd like to count all entries in my ES instance, having a timestamp from
the last day and group together all entries having the same
"instanceId"
. With the data below, the count result should be 1 (and
not 2) since 2 entries are within the last day but they have the same
instanceId of "def".

I tried the following:

curl -XPOST "
http://localhost:9200/installs/install/_search?pretty=1&fields=_source,_timestamp"
-d'
{
"aggs": {
"lastday" : {
"filter" : {
"range" : {
"_timestamp" : {
"gt" : "now-1d"
}
}
},
"aggs" : {
"instanceids" : {
"terms" : { "field" : "instanceId" }
}
}
}
}
}'

But I have 3 problems with this:

  • It's not a count but a search. "aggs" don't seem to work with _count
  • It returns all entries in the result before the aggs data

For these two issues, you probably want to check out the count search
type[1] which works with aggregations. It's like a regular search, but
doesn't do perform the fetch phase in order to fetch the top hits.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#count

  • In the aggs I don't get a direct count value and I have to count the
    number of buckets to get my answer

We recently (Elasticsearch 1.1.0) added a cardinality[2] aggregation, that
allows for counting unique values. In previous versions of Elasticsearch,
counting was indeed only possible through the terms aggregation with a high
size parameter, but this was inefficient on high-cardinality fields.

[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#search-aggregations-metrics-cardinality-aggregation

Here is a gist that gives an example of the count search_type and the
cardinality aggregation:
https://gist.github.com/jpountz/9930690

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2260e806-b42b-4936-a9ec-5079e691108f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Vincent Massol) #4

Actually I've just realized I'm going to hit a problem... I wanted to use
Kibana to graph this for me but I'm not sure Kibana supports
"aggregations"...

Any idea?

Thanks
-Vincent

On Wednesday, April 2, 2014 11:38:14 AM UTC+2, Vincent Massol wrote:

Thanks a lot for your fast response Adrien!

  • I noticed the cardinality aggregation but I was worried by the "an
    approximate count of distinct values." part of the documentation. I need an
    exact value, not an approximate one :slight_smile: However I've read more the
    documentation and it may not be a real problem in practice, especially if I
    use a threshold of 40000 (the max apparently). I couldn't find the default
    precision value BTW in the documentation.
  • From your answer I gather that using aggregations is the only solution
    to my problem and there's no way to use the Query DSL to solve it.

Thanks, it helps a lot!
-Vincent

On Wednesday, April 2, 2014 11:17:17 AM UTC+2, Adrien Grand wrote:

Hi Vincent,

I left some replies inline:

On Wed, Apr 2, 2014 at 10:02 AM, Vincent Massol vma...@gmail.com wrote:

Hi guys,

I'd like to count all entries in my ES instance, having a timestamp from
the last day and group together all entries having the same
"instanceId"
. With the data below, the count result should be 1 (and
not 2) since 2 entries are within the last day but they have the same
instanceId of "def".

I tried the following:

curl -XPOST "
http://localhost:9200/installs/install/_search?pretty=1&fields=_source,_timestamp"
-d'
{
"aggs": {
"lastday" : {
"filter" : {
"range" : {
"_timestamp" : {
"gt" : "now-1d"
}
}
},
"aggs" : {
"instanceids" : {
"terms" : { "field" : "instanceId" }
}
}
}
}
}'

But I have 3 problems with this:

  • It's not a count but a search. "aggs" don't seem to work with _count
  • It returns all entries in the result before the aggs data

For these two issues, you probably want to check out the count search
type[1] which works with aggregations. It's like a regular search, but
doesn't do perform the fetch phase in order to fetch the top hits.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#count

  • In the aggs I don't get a direct count value and I have to count the
    number of buckets to get my answer

We recently (Elasticsearch 1.1.0) added a cardinality[2] aggregation,
that allows for counting unique values. In previous versions of
Elasticsearch, counting was indeed only possible through the terms
aggregation with a high size parameter, but this was inefficient on
high-cardinality fields.

[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#search-aggregations-metrics-cardinality-aggregation

Here is a gist that gives an example of the count search_type and the
cardinality aggregation:
https://gist.github.com/jpountz/9930690

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0a0ba031-ab73-40d7-8397-dc536343ddf8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #5

I need an exact value, not an approximate one :slight_smile: However I've read more
the documentation and it may not be a real problem in practice, especially
if I use a threshold of 40000 (the max apparently). I couldn't find the
default precision value BTW in the documentation.

Do you really need an exact value? I mean that even if counts are
approximate, they tend to be precise and to have an error which is around
1%. But going from this 1% error margin to an accurate count would make the
aggregation MUCH more costly.

Counting high-cardinalities in a distributed system is actually a tough
issue, I tried to explain the reason why we chose an approximate algorithm
and don't provide an option to get accurate counts in our last blog post,
you might want to check it out:
http://www.elasticsearch.org/blog/count-elasticsearch/

We didn't document the default value because we try to have sensible
defaults depending on how your aggregation is structured. For example, if
you use a top-level aggregation, this is not an issue to setup a high
threshold since there would be a single counter, but it you use a
cardinality aggregation under a very large terms or histogram aggregation
this is a different story since there would be many more counters.

Actually I've just realized I'm going to hit a problem... I wanted to use
Kibana to graph this for me but I'm not sure Kibana supports
"aggregations"...

Indeed. There is significant work to be performed in order to integrate
aggregations into Kibana, so unfortunately this will probably take some
time to be implemented.

On Wed, Apr 2, 2014 at 6:43 PM, Vincent Massol vmassol@gmail.com wrote:

Actually I've just realized I'm going to hit a problem... I wanted to use
Kibana to graph this for me but I'm not sure Kibana supports
"aggregations"...

Any idea?

Thanks
-Vincent

On Wednesday, April 2, 2014 11:38:14 AM UTC+2, Vincent Massol wrote:

Thanks a lot for your fast response Adrien!

  • I noticed the cardinality aggregation but I was worried by the "an
    approximate count of distinct values." part of the documentation. I need an
    exact value, not an approximate one :slight_smile: However I've read more the
    documentation and it may not be a real problem in practice, especially if I
    use a threshold of 40000 (the max apparently). I couldn't find the default
    precision value BTW in the documentation.
  • From your answer I gather that using aggregations is the only solution
    to my problem and there's no way to use the Query DSL to solve it.

Thanks, it helps a lot!
-Vincent

On Wednesday, April 2, 2014 11:17:17 AM UTC+2, Adrien Grand wrote:

Hi Vincent,

I left some replies inline:

On Wed, Apr 2, 2014 at 10:02 AM, Vincent Massol vma...@gmail.comwrote:

Hi guys,

I'd like to count all entries in my ES instance, having a timestamp
from the last day and group together all entries having the same
"instanceId"
. With the data below, the count result should be 1 (and
not 2) since 2 entries are within the last day but they have the same
instanceId of "def".

I tried the following:

curl -XPOST "http://localhost:9200/installs/install/_search?
pretty=1&fields=_source,_timestamp" -d'
{
"aggs": {
"lastday" : {
"filter" : {
"range" : {
"_timestamp" : {
"gt" : "now-1d"
}
}
},
"aggs" : {
"instanceids" : {
"terms" : { "field" : "instanceId" }
}
}
}
}
}'

But I have 3 problems with this:

  • It's not a count but a search. "aggs" don't seem to work with _count
  • It returns all entries in the result before the aggs data

For these two issues, you probably want to check out the count search
type[1] which works with aggregations. It's like a regular search, but
doesn't do perform the fetch phase in order to fetch the top hits.

[1] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/search-request-search-type.html#count

  • In the aggs I don't get a direct count value and I have to count the
    number of buckets to get my answer

We recently (Elasticsearch 1.1.0) added a cardinality[2] aggregation,
that allows for counting unique values. In previous versions of
Elasticsearch, counting was indeed only possible through the terms
aggregation with a high size parameter, but this was inefficient on
high-cardinality fields.

[2] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/search-aggregations-metrics-
cardinality-aggregation.html#search-aggregations-metrics-
cardinality-aggregation

Here is a gist that gives an example of the count search_type and the
cardinality aggregation:
https://gist.github.com/jpountz/9930690

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0a0ba031-ab73-40d7-8397-dc536343ddf8%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/0a0ba031-ab73-40d7-8397-dc536343ddf8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j79jhvu-%3DbzakKYUiSiieUNWZuBWaiVZgFxnjWCYzLV6g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6