Create histogram for size of each unique value in an index

(Andrew McFague) #1

My cluster has around 3.5 billion documents, and each document has a list of "foreign IDs" that are relevant to the health of the system its tracking. However, these foreign IDs are constantly changing, so I'd like to setup a sort of histogram, over time, of the indexes, so that other systems can detect when there is a substantial increase/decrease of the number of documents matched by a given foreign ID.

For example,

DocA: {"foreignIDs": [1, 2, 3]}
DocB: {"foreignIDs": [2, 3]}
DocC: {"foreignIDs": [1,4]}

I'd like to regular poll and get the number of documents, such as:

{1: 2, 2: 2, 3: 2, 4: 1}

Currently, I am able to use the Terms Aggregate to get a complete list, but this seems to be a very expensive operation and can overwhelm the cluster causing updates to timeout. It also buckets the data, which can allegedly result in inaccurate counts. It's also a MASSIVE amount of data that is returned.

Is there any recommended way to get this information out of Elasticsearch without negatively impacting the cluster? Does Elasticsearch itself provide any means of statically maintaining this count? Or at least a way to optimize the lookups?

Thanks for all your help, and keep up the great work!


(JDK8, Elasticsearch 2.0.0)

(system) #2