Count in other index based on Current Index field

Vinothkumar_Ganeshan · November 29, 2019, 1:30pm

Scenario:
Index - 2019-05*, 2019-06*, 2019-07*, 2019-08*, 2019-09*

I have ID field in all the index which Contains Millions of documents.

I want to analyse, how many id appeared in 2019-05* is repeated in other index like 2019-06*, 2019-07* ....

Before asking,
I did google search, Forums and come up with ideas like Aggregation, Scroll, Composite aggregation but it doesn't directly solve the problem I'm looking for. I'm migrating from SQL to ES, I may fail to tell exact problem, so I gave the scenario.

Thanks for understanding. Thanks for your support.

Mark_Harwood · November 29, 2019, 4:17pm

{
  "size": 0,
  "aggs": {
	"IDs": {
	  "terms": {
		"field": "ID"
	  },
	  "aggs": {
		"number_of_indices": {
		  "cardinality": {
			"field": "_index"
		  }
		}
	  }
	}
  }
}

You can use the cardinality aggregation on the _index field to count the number of indices. I've used the terms aggregation in this example but if you have large numbers of IDs then swap this for the composite aggregation and used the after parameter to page through results.

Vinothkumar_Ganeshan · November 29, 2019, 4:25pm

Thanks a lot. It helps, a suggestion. Actually this gives the individual count of each id in other index, its bit expensive based on my data size.

My need is just the count. For example: Index - 2019-05* has ID values [1,2,3,4, 5, 6]
Index - 2019-06* has ID values [4,5,6].
Now mapping value counts are 3 [4,5,6] in both the indexes ID term.
I just need the result 3. May be intersection count of ID term on both the index but I don't know how to say it in ElasticSearch. Thank you

Mark_Harwood · November 29, 2019, 4:42pm

That's an expensive thing to do in any kind of distributed system

This will be a little complex but here we go.
You can run this query to get the IDs that have appeared in many indices:

GET myindices*/_search
{
  "size": 0,
  "aggs": {
	"IDsSortedByNumIndices": {
	  "terms": {
		"field": "ID",
		"size":5000,
		"order": {
		  "number_of_indices": "desc"
		}
	  },
	  "aggs": {
		"number_of_indices": {
		  "cardinality": {
			"field": "_index"
		  }
		}
	  }
	}
  }
}

The challenge here is that you can't set the size on the terms agg to a massive number because you'll run out of memory. Instead you could use the composite agg and repeatedly call with after clause but it only sorts by ID key (not the cardinality agg we are using as a child). If the majority of IDs appear in only one index this could be wasteful because your client app would have to throw away all the results with low index cardinality.

There is a way to "page" through my example terms aggregation maintaining the sort by index-popularity. You can do this by breaking the IDs into N arbitrary groups using the terms partitioning feature.

You may find this wizard useful for considering different grouping approaches.

Vinothkumar_Ganeshan · December 2, 2019, 9:01am

Hello Mark, Thanks for your time. In simple terms, I just used, your Above code and got

"reason" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [<reused_arrays>] would be [19233470920/17.9gb], which is larger than the limit of [16238008729/15.1gb], real usage: [12220742088/11.3gb], new bytes reserved: [7012728832/6.5gb], usages [request=7042945328/6.5gb, fielddata=30858/30.1kb, in_flight_requests=1736/1.6kb, accounting=331419640/316mb]",
          "bytes_wanted" : 19233470920,
          "bytes_limit" : 16238008729,
          "durability" : "TRANSIENT"}

I think, I'm overloading the computation. But not sure, as I don't want to crash the server. Your suggestion would be helpful.

Mark_Harwood · December 2, 2019, 9:31am

You can lower the cost of the cardinality agg using the precision_threshold which I suggest you set to the number of indices you have to be sized appropriately.

system · December 30, 2019, 9:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregations across multiple indices Elasticsearch	3	6126	July 6, 2017
How do I determine the difference between two indices? Elasticsearch	2	8990	June 15, 2018
Metrics from Aggregation results across two different buckets Elasticsearch	3	365	May 17, 2019
Intersect documents with same id in two indices (5.6) Elasticsearch	4	795	August 13, 2020
Query DSL count distinct Elasticsearch	16	4211	April 13, 2022

Count in other index based on Current Index field

Related topics