Sbuqueries and distinct counts


(Tim Uckun) #1

I want to do something like this.

select date_trunc('month', time_stamp), sum(distinct_count) from (
select date_trunc('week', time_stamp) as time_stamp, count(distinct
field_name) as distinct_count
from blah
group by date_trun('week', time_stamp)
)
group by date_trunc('month', time_stamp)

So basically I want to break up the data into weekly chunks and count the
distinct appearances of a value and then sum those up on a per monthly
basis.

In preparation for that I tried to do the subquery for that which looks
like this

aggregations: {
by_month: {
date_histogram: {
field: "time_stamp",
interval: "1M",
format: "yyyy-MM-dd HH:mm"
},
aggregations: {
by_node_mac: {
terms: {
field: "node_mac"
},
aggregations: {
cardinality: {field: 'device_mac'}
}

                          }
                      }
                  }
              }

}

but I seem to be getting the wrong answers. I am using fake data which
should give me very low numbers for the cardinality but it actually seems
to be counting the number of rows not the number of distinct items. The
numbers are outrageously high.

I tried a precision threshold of 1000 and 100 but it seems to make no
difference.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4cafbefc-bf00-49aa-9c7c-2240c4f1fd55%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Tim Uckun) #2

I hate to bump myself but does anybody have any input on this at all?

On Thursday, April 10, 2014 11:59:43 AM UTC+12, Tim Uckun wrote:

I want to do something like this.

select date_trunc('month', time_stamp), sum(distinct_count) from (
select date_trunc('week', time_stamp) as time_stamp, count(distinct
field_name) as distinct_count
from blah
group by date_trun('week', time_stamp)
)
group by date_trunc('month', time_stamp)

So basically I want to break up the data into weekly chunks and count the
distinct appearances of a value and then sum those up on a per monthly
basis.

In preparation for that I tried to do the subquery for that which looks
like this

aggregations: {
by_month: {
date_histogram: {
field: "time_stamp",
interval: "1M",
format: "yyyy-MM-dd HH:mm"
},
aggregations: {
by_node_mac: {
terms: {
field: "node_mac"
},
aggregations: {
cardinality: {field: 'device_mac'}
}

                          }
                      }
                  }
              }

}

but I seem to be getting the wrong answers. I am using fake data which
should give me very low numbers for the cardinality but it actually seems
to be counting the number of rows not the number of distinct items. The
numbers are outrageously high.

I tried a precision threshold of 1000 and 100 but it seems to make no
difference.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/016b94f6-4b62-4009-83ec-ccef3bfb17f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alexander Reelsen) #3

Hey,

you may want to provide a gist with some sample data, so people can
follow.. Also you might need to use the value_count aggregation somewhere,
to make sure you are only using the counts, but this really depends on your
data model and the data being indexed, so full gist might help a lot.

--Alex

On Fri, Apr 11, 2014 at 12:45 AM, Tim Uckun timuckun@gmail.com wrote:

I hate to bump myself but does anybody have any input on this at all?

On Thursday, April 10, 2014 11:59:43 AM UTC+12, Tim Uckun wrote:

I want to do something like this.

select date_trunc('month', time_stamp), sum(distinct_count) from (
select date_trunc('week', time_stamp) as time_stamp, count(distinct
field_name) as distinct_count
from blah
group by date_trun('week', time_stamp)
)
group by date_trunc('month', time_stamp)

So basically I want to break up the data into weekly chunks and count the
distinct appearances of a value and then sum those up on a per monthly
basis.

In preparation for that I tried to do the subquery for that which looks
like this

aggregations: {
by_month: {
date_histogram: {
field: "time_stamp",
interval: "1M",
format: "yyyy-MM-dd HH:mm"
},
aggregations: {
by_node_mac: {
terms: {
field: "node_mac"
},
aggregations: {
cardinality: {field: 'device_mac'}
}

                          }
                      }
                  }
              }

}

but I seem to be getting the wrong answers. I am using fake data which
should give me very low numbers for the cardinality but it actually seems
to be counting the number of rows not the number of distinct items. The
numbers are outrageously high.

I tried a precision threshold of 1000 and 100 but it seems to make no
difference.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/016b94f6-4b62-4009-83ec-ccef3bfb17f6%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/016b94f6-4b62-4009-83ec-ccef3bfb17f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM8JwkQHOsxUwt6ddH2W-f1nNcttUvGWq0AXZiFmVt9bSQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4