I did a data ingestion in the SQL Server data ES, the total number of documents is the same, but when I create a visualization in Kibana the values are different when I make a different count by product for example, I did the conference in Tableau and SQl Server and the Kibana data doesn't make sense, does anyone know why?
Can you share your documents, the visualization and query, and the problematic results?
I believe that the difference you are seeing is because of the way Elasticsearch works. Our SQL access for use by clients like Tableau is meant for access to individual documents, which is inherently going to be slower, but is also completely accurate. The screenshot you have showed in Lens is using the Cardinality aggregation in Elasticsearch, which has a default "precision threshold" of 3000 documents. Since your value is above 3000, it is an estimate.
This behavior is documented: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html
Despite the documented behavior, I do agree that this can be confusing. Because I am one of the engineers working on Lens, I will write an issue for us to add the precision threshold flag.
So it is not possible to have the precise value above 3000 using Lens or any Kibana visualization?
Could this be due to the fact that I ingested more than 100,000 documents at once? Could you clarify me better?
I believe that you can do this same visualization with a higher precision threshold using the Kibana "Metric" visualization. You will need to type this into the "Advanced JSON Input" section:
{ "precision_threshold": 4000 }
Okay, there are two possibilities that I'd like you to share:
- Can you set the threshold to
40000
, which is the max value? - Is there a difference in your mapping between the fields
dc_produto.keyword
anddc_produto
?
thanks for helping me.
You can set the limit to 40000, what is the maximum value?
I changed it to 40000 but the result was 5 less, nothing close to the correct value
Is there a difference in your mapping between the fields dc_produto.keyword and dc_produto?
No! They are the same but the "keyword" is the field that is shown in the Kibana View
This indicates that there are probably documents that you ingested without a timestamp. Since you have a time field on your index pattern, we are generating a time range query on that field: this will automatically exclude any documents without a time field. My advice is to delete the index pattern, and to create a new one without a timefield: this will prevent the time filter from being added.
40000 is the maximum threshold, per the docs.
I made the change as in the print, do you think there is something wrong with this image?
All other places using the same "dc_product" column and the value is 4052
Given that all of the cardinality estimates are in the same range, it seems like there really is a difference in the data between what the SQL interface and aggregation interface are using. It's unclear what the cause of this could be. I think you need to debug this by looking at what Elasticsearch has actually indexed for dc_producto.keyword
using a docvalue_fields: ['dc_produto.keyword']
documents query. This could show you if the full-text field contains different data than the indexed field.
I realized that the indexing was done but several data were divergent, when I do with a much smaller mass it happens in the correct way, would you have any idea?
It sounds like you've identified the likely cause of the problem, which is about as much as I can help with. Figuring out the problems with your data ingestion setup requires a lot more knowledge about those components of the system, which I don't have. You can talk to our Elastic support engineers, or ask in one of our other forums dedicated to these kinds of Elasticsearch configuration issues.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.