Strange behaviour when counting events in bar chart


#1

Hi
I want to create a vertical bar chart visualisation where I count the number of events depending to event timestamp, and I notice a behavior that does not understand.

My event time field is called ACT_HEURE_ACTIVITE.
My time dimension spreads from 2th of may 00:00:00 to 3th of may 00:00:00.

When I set my Y-axis agregation to Count Kibana returns 102 ivents. Same if I change to Unique count using INC_ID field (which is a unique ID).

When I let my Y-axis agregation to Count mode and set my X-axis to Date Histogram using ACT_HEURE_ACTIVITE and daily interval, Kibana still returns 102 events.

But when I change my Y-axis agregation to Unique Count using INC_ID field the system returns 100 events...

And I change the Y-axis field to ACT_HEURE_ACTIVITE Kibana returns 103 events !

Could anyone explain this behaviour?


(Lee Drengenberg) #2

What count does the Discover tab show for that time range? I think that would be the most accurate count of docs since it isn't doing any aggregations.


#3

102 events


(Lee Drengenberg) #4

So it seems you must have 102 documents in that time range.

But it seems that either 3 documents have the same INC_ID, or 2 pairs of documents have the same INC_ID such that there's only 100 unique INC_IDs. On the Discover tab you could add the INC_ID field to the table and sort by that to try to see the duplicates.

And I change the Y-axis field to ACT_HEURE_ACTIVITE Kibana returns 103 events !

You don't show a screen shot of this last combination. I don't know why you would see a count of 103. But I can't think of any reason to use the timestamp on both the X and Y Axis.

Another way to find the duplicates might be to follow the example in this screen shot and put JSON in the advanced JSON Input field (except you would use { "min_doc_count":2} and that should show you only the duplicate INC_IDs.


#5

I added the INC_ID field in Discover tab and exported data to Excel (copy-paste values). My sheet containts 102 lines. When I remove duplicates on INC_ID column I also obtain 102 distinct values.

This is the result:


Indeed there is no reason to use timestamp on Y-axis as well. My goal was to use another unique field and check the agregation function.

I also tried to follow your example but I think I do something wrong because I don't obtain expected result.

Any idea?


(Lee Drengenberg) #6

I think on the min_doc_count part you should trying changing "Date Histogram" to "Terms" and change "Field" to INC_ID


#7

I set my visualization as said and obtain a "No results found" message.


#8

I set up a new basic testing environement using following scripts to create index and mapping, and import datas.


Data file contains 324 events which concerns only the 2th of may 2016.

As you can see my new index contains 324 documents.

The mapping has following structure:

Events are displayed in Discover tab:

I obtain 324 results when I agregate my data by date histogram on ACT_ACTIVITY_TIME field.

But if I choose to set my metric agregation type to unique count I obtains 335 results...


Note that I obtain differents results depending to the field I choose.


(Lee Drengenberg) #9

Your last screenshot has a time range of 'Last 60 days'. I think you meant for that to be only the 2nd of May 2016. Do you get 324 if you set it to only that day?


#10

I still obtain 335 results if I set the time range from 2nd to 3rd of may and choose a daliy custom interval.

In my previous post I didn't set time range because elasticsearch index only contains events data that occured the 2nd of may.


#11

Any idea?


(Lee Drengenberg) #12

I started looking into this again. One thing to look at is this;

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

I don't see the precision_thresholdspecified in the request for a visualization with unique count used so I guess it uses the default. The default value is described here, but don't make complete sense to me;

Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).

I also tried a test with test data I had, and when I charted unique count of a time field I also got many more results than I expected. But then I found that the time field represented an array and some docs had multiple time values in them.


(system) #13