Scenario:
Index - 2019-05*, 2019-06*, 2019-07*, 2019-08*, 2019-09*
I have ID field in all the index which Contains Millions of documents.
I want to analyse, how many id appeared in 2019-05* is repeated in other index like 2019-06*, 2019-07* ....
Before asking,
I did google search, Forums and come up with ideas like Aggregation, Scroll, Composite aggregation but it doesn't directly solve the problem I'm looking for. I'm migrating from SQL to ES, I may fail to tell exact problem, so I gave the scenario.
Thanks for understanding. Thanks for your support.
You can use the cardinality aggregation on the _index field to count the number of indices. I've used the terms aggregation in this example but if you have large numbers of IDs then swap this for the composite aggregation and used the after parameter to page through results.
Thanks a lot. It helps, a suggestion. Actually this gives the individual count of each id in other index, its bit expensive based on my data size.
My need is just the count. For example: Index - 2019-05* has ID values [1,2,3,4, 5, 6]
Index - 2019-06* has ID values [4,5,6].
Now mapping value counts are 3 [4,5,6] in both the indexes ID term.
I just need the result 3. May be intersection count of ID term on both the index but I don't know how to say it in ElasticSearch. Thank you
The challenge here is that you can't set the size on the terms agg to a massive number because you'll run out of memory. Instead you could use the composite agg and repeatedly call with after clause but it only sorts by ID key (not the cardinality agg we are using as a child). If the majority of IDs appear in only one index this could be wasteful because your client app would have to throw away all the results with low index cardinality.
There is a way to "page" through my example terms aggregation maintaining the sort by index-popularity. You can do this by breaking the IDs into N arbitrary groups using the terms partitioning feature.
You may find this wizard useful for considering different grouping approaches.
Hello Mark, Thanks for your time. In simple terms, I just used, your Above code and got
"reason" : {
"type" : "circuit_breaking_exception",
"reason" : "[parent] Data too large, data for [<reused_arrays>] would be [19233470920/17.9gb], which is larger than the limit of [16238008729/15.1gb], real usage: [12220742088/11.3gb], new bytes reserved: [7012728832/6.5gb], usages [request=7042945328/6.5gb, fielddata=30858/30.1kb, in_flight_requests=1736/1.6kb, accounting=331419640/316mb]",
"bytes_wanted" : 19233470920,
"bytes_limit" : 16238008729,
"durability" : "TRANSIENT"}
I think, I'm overloading the computation. But not sure, as I don't want to crash the server. Your suggestion would be helpful.
You can lower the cost of the cardinality agg using the precision_threshold which I suggest you set to the number of indices you have to be sized appropriately.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.