We have report of page link interlinking where each document contains a source link and target link. We want to find how many documents have 1 link vs how many have 2 vs 3 etc.
Normally I'd approach this as a terms aggs at the source link - this link has 2 docs, this has 5, etc. But we have maybe a million links and I don't think a terms aggs would be a good idea. And on top of that, we really just what the number that have this many links, we don't care about the actual links.
How could I do this in Elasticsearch?
You could do that with Graph pretty simply, but you may run into the same problems at that large scale. It'd certainly be hard to show the relationships with so many links.
Thanks for the reply. I don't think I need the actual terms, just the histogram of the number of terms that have the number of documents.
For example, if I have 1 million urls that are each intern-linked to 10 other urls. I have documents which have the source_url and the target_url. I want to know how many target_urls have 1 link in (source_url), how many have 2, etc. I don't care what the urls are or how they are linked, just the histogram of numbers. If the documents are perfectly interconnected, I might have 10M documents with 10 connections. More realistically, there will be many with 1 link and many with 50 links.. I need to know how many.
Perhaps this could be done with a pipeline aggregation - the issue is that it seems I usually need to also return the primary aggregation results and that would be a lot of documents (1 M) .