Divergence in data when sent to Kibana

FernandoMOTA · June 24, 2020, 3:16pm

I did a data ingestion in the SQL Server data ES, the total number of documents is the same, but when I create a visualization in Kibana the values are different when I make a different count by product for example, I did the conference in Tableau and SQl Server and the Kibana data doesn't make sense, does anyone know why?

mattkime · June 24, 2020, 3:27pm

Can you share your documents, the visualization and query, and the problematic results?

FernandoMOTA · June 24, 2020, 3:42pm

The difference is there:

wylie · June 24, 2020, 5:43pm

I believe that the difference you are seeing is because of the way Elasticsearch works. Our SQL access for use by clients like Tableau is meant for access to individual documents, which is inherently going to be slower, but is also completely accurate. The screenshot you have showed in Lens is using the Cardinality aggregation in Elasticsearch, which has a default "precision threshold" of 3000 documents. Since your value is above 3000, it is an estimate.

This behavior is documented: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

Despite the documented behavior, I do agree that this can be confusing. Because I am one of the engineers working on Lens, I will write an issue for us to add the precision threshold flag.

FernandoMOTA · June 24, 2020, 7:41pm

So it is not possible to have the precise value above 3000 using Lens or any Kibana visualization?

Could this be due to the fact that I ingested more than 100,000 documents at once? Could you clarify me better?

wylie · June 24, 2020, 8:07pm

I believe that you can do this same visualization with a higher precision threshold using the Kibana "Metric" visualization. You will need to type this into the "Advanced JSON Input" section:

{ "precision_threshold": 4000 }

FernandoMOTA · June 24, 2020, 8:32pm

I made the change and put even higher threshold values but the value was far from expected.

wylie · June 24, 2020, 8:53pm

Okay, there are two possibilities that I'd like you to share:

Can you set the threshold to 40000, which is the max value?
Is there a difference in your mapping between the fields dc_produto.keyword and dc_produto?

FernandoMOTA · June 24, 2020, 9:01pm

thanks for helping me.

You can set the limit to 40000, what is the maximum value?

I changed it to 40000 but the result was 5 less, nothing close to the correct value

Is there a difference in your mapping between the fields dc_produto.keyword and dc_produto?

No! They are the same but the "keyword" is the field that is shown in the Kibana View

wylie · June 24, 2020, 9:14pm

This indicates that there are probably documents that you ingested without a timestamp. Since you have a time field on your index pattern, we are generating a time range query on that field: this will automatically exclude any documents without a time field. My advice is to delete the index pattern, and to create a new one without a timefield: this will prevent the time filter from being added.

40000 is the maximum threshold, per the docs.

FernandoMOTA · June 24, 2020, 9:33pm

I made the change as in the print, do you think there is something wrong with this image?
All other places using the same "dc_product" column and the value is 4052

wylie · June 24, 2020, 10:13pm

Given that all of the cardinality estimates are in the same range, it seems like there really is a difference in the data between what the SQL interface and aggregation interface are using. It's unclear what the cause of this could be. I think you need to debug this by looking at what Elasticsearch has actually indexed for dc_producto.keyword using a docvalue_fields: ['dc_produto.keyword'] documents query. This could show you if the full-text field contains different data than the indexed field.

FernandoMOTA · June 25, 2020, 8:44pm

I realized that the indexing was done but several data were divergent, when I do with a much smaller mass it happens in the correct way, would you have any idea?

wylie · June 25, 2020, 10:27pm

It sounds like you've identified the likely cause of the problem, which is about as much as I can help with. Figuring out the problems with your data ingestion setup requires a lot more knowledge about those components of the system, which I don't have. You can talk to our Elastic support engineers, or ask in one of our other forums dedicated to these kinds of Elasticsearch configuration issues.

system · July 23, 2020, 10:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Precision Threshold of Count in Canvas Kibana canvas	6	1577	July 1, 2021
Kibana Term Aggregation Kibana	6	290	June 23, 2020
Mismatch between Elastic Query Aggretion and Kibana Visualize Function Kibana	2	147	June 14, 2023
KIbana Cardinality/Unique Count Metric Lens / TSVB Kibana	4	800	November 2, 2021
Kibana visualization showing data as "Other" Kibana	7	3139	September 28, 2021

Divergence in data when sent to Kibana

Related topics