Strange behaviour when counting events in bar chart


#1

Hi
I want to create a vertical bar chart visualisation where I count the number of events depending to event timestamp, and I notice a behavior that does not understand.

My event time field is called ACT_HEURE_ACTIVITE.
My time dimension spreads from 2th of may 00:00:00 to 3th of may 00:00:00.

When I set my Y-axis agregation to Count Kibana returns 102 ivents. Same if I change to Unique count using INC_ID field (which is a unique ID).

When I let my Y-axis agregation to Count mode and set my X-axis to Date Histogram using ACT_HEURE_ACTIVITE and daily interval, Kibana still returns 102 events.

But when I change my Y-axis agregation to Unique Count using INC_ID field the system returns 100 events...

And I change the Y-axis field to ACT_HEURE_ACTIVITE Kibana returns 103 events !

Could anyone explain this behaviour?


(Lee Drengenberg) #2

What count does the Discover tab show for that time range? I think that would be the most accurate count of docs since it isn't doing any aggregations.


#3

102 events


(Lee Drengenberg) #4

So it seems you must have 102 documents in that time range.

But it seems that either 3 documents have the same INC_ID, or 2 pairs of documents have the same INC_ID such that there's only 100 unique INC_IDs. On the Discover tab you could add the INC_ID field to the table and sort by that to try to see the duplicates.

And I change the Y-axis field to ACT_HEURE_ACTIVITE Kibana returns 103 events !

You don't show a screen shot of this last combination. I don't know why you would see a count of 103. But I can't think of any reason to use the timestamp on both the X and Y Axis.

Another way to find the duplicates might be to follow the example in this screen shot and put JSON in the advanced JSON Input field (except you would use { "min_doc_count":2} and that should show you only the duplicate INC_IDs.


#5

I added the INC_ID field in Discover tab and exported data to Excel (copy-paste values). My sheet containts 102 lines. When I remove duplicates on INC_ID column I also obtain 102 distinct values.

This is the result:


Indeed there is no reason to use timestamp on Y-axis as well. My goal was to use another unique field and check the agregation function.

I also tried to follow your example but I think I do something wrong because I don't obtain expected result.

Any idea?


(Lee Drengenberg) #6

I think on the min_doc_count part you should trying changing "Date Histogram" to "Terms" and change "Field" to INC_ID


#7

I set my visualization as said and obtain a "No results found" message.


#8

I set up a new basic testing environement using following scripts to create index and mapping, and import datas.
https://www.wetransfer.com/downloads/0e2cb9930e37a56a7504ba70160e295d20160606123941/9e3c84
Data file contains 324 events which concerns only the 2th of may 2016.

As you can see my new index contains 324 documents.

The mapping has following structure:

Events are displayed in Discover tab:

I obtain 324 results when I agregate my data by date histogram on ACT_ACTIVITY_TIME field.

But if I choose to set my metric agregation type to unique count I obtains 335 results...


Note that I obtain differents results depending to the field I choose.


(Lee Drengenberg) #9

Your last screenshot has a time range of 'Last 60 days'. I think you meant for that to be only the 2nd of May 2016. Do you get 324 if you set it to only that day?


#10

I still obtain 335 results if I set the time range from 2nd to 3rd of may and choose a daliy custom interval.

In my previous post I didn't set time range because elasticsearch index only contains events data that occured the 2nd of may.


#11

Any idea?


(Lee Drengenberg) #12

I started looking into this again. One thing to look at is this;

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

I don't see the precision_thresholdspecified in the request for a visualization with unique count used so I guess it uses the default. The default value is described here, but don't make complete sense to me;

Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).

I also tried a test with test data I had, and when I charted unique count of a time field I also got many more results than I expected. But then I found that the time field represented an array and some docs had multiple time values in them.


(system) closed #13