Aggregation across multiple indexes/indices - significant terms

anthonyfps · February 9, 2022, 9:09pm

Hi, we have indexes that are split by date for manageability but contain the same mappings.

I'm currently trying to use the significant terms aggregation to identify new terms, by specifying the foreground as e.g. the last day and the background as the rest of history. When the aggregation is confined to one index it works as expected, however the background frequencies with multiple indexes do not include the same fields across all the indices, I imagine only including the background frequency within the index that the term was found in. Is this expected behaviour?

Tomo_M · February 12, 2022, 3:39pm

Could explain more about this?
Do background frequencies contain other fields from other indices?

Why do you think so? I suppose its reasonable background frequency calculated over all indices regardless of whether the index contains the term.

Mark_Harwood · February 13, 2022, 12:37pm

Hi Anthony.
Yes, this is expected behaviour.
Background frequency checks on potentially millions of candidate terms is expensive and so the implementation works with local stats found in a shard.
The goal of finding “what is significant today?” is just not feasible using only day-based indices.

Distributed data is the enemy of a lot of analytical functions I’m afraid.

anthonyfps · February 17, 2022, 1:24pm

Thanks for replying! I was imprecise here, I meant that the multiple indexes do not include the same terms rather than fields - it seems that statistics are only collated locally (it seems to a shard) rather than across the whole group of indexes. The bg_count number seems to include the total across the indices however, up to a limit.

I found that the background count did not include occurences of the term in other indices.

anthonyfps · February 17, 2022, 1:25pm

Thanks for getting back to me. Sounds reasonable. The background count for that term seems to be an extra search, it would have to be limited to the local shard or there may be in aggregate many many terms to search for across all the shards. It's straightforward enough to compile the information another way.

system · March 17, 2022, 1:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Background count in significant_terms not consistent? Elasticsearch	8	3583	July 5, 2017
Significant_terms aggregation with sampling Elasticsearch	2	216	January 20, 2023
Significant terms and logstash Elasticsearch	2	362	July 6, 2017
Significant Term aggregation Elasticsearch	9	624	July 6, 2017
How is the score of Significant Term aggregation calculated? Elasticsearch	7	625	September 12, 2018

Aggregation across multiple indexes/indices - significant terms

Related topics