Hi, we have indexes that are split by date for manageability but contain the same mappings.
I'm currently trying to use the significant terms aggregation to identify new terms, by specifying the foreground as e.g. the last day and the background as the rest of history. When the aggregation is confined to one index it works as expected, however the background frequencies with multiple indexes do not include the same fields across all the indices, I imagine only including the background frequency within the index that the term was found in. Is this expected behaviour?
Hi Anthony.
Yes, this is expected behaviour.
Background frequency checks on potentially millions of candidate terms is expensive and so the implementation works with local stats found in a shard.
The goal of finding “what is significant today?” is just not feasible using only day-based indices.
Distributed data is the enemy of a lot of analytical functions I’m afraid.
Thanks for replying! I was imprecise here, I meant that the multiple indexes do not include the same terms rather than fields - it seems that statistics are only collated locally (it seems to a shard) rather than across the whole group of indexes. The bg_count number seems to include the total across the indices however, up to a limit.
I found that the background count did not include occurences of the term in other indices.
Thanks for getting back to me. Sounds reasonable. The background count for that term seems to be an extra search, it would have to be limited to the local shard or there may be in aggregate many many terms to search for across all the shards. It's straightforward enough to compile the information another way.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.