Find Trending topics without background set

sj12345 · April 5, 2021, 5:59am

i have read this article Find trending article

Referring to Mark Harwood's comment on that page : " maybe it’s failing to find anything statistically significant in the sample of 100 headlines that differs materially from other days."

I had a follow-up question. Is it possible to run a "significant text" type aggregation without any background set ?

I want to find trending topics (i.e. unusual phrases) within a particular set of documents without any reference to any background set

how can this be done ?

Mark_Harwood · April 5, 2021, 1:33pm

I’ve used various datasets as background before now e.g a thousand or so English Wikipedia articles or pages from Project Gutenberg books.
When put in the same index/field name as your content it provides background stats for everyday language and gives a baseline to diff against. You just need a “dataset_name” field or similar to query in order to make the foreground matches your content but non of the background.

sj12345 · April 5, 2021, 4:17pm

Is there an independent metric to verify if this trending topic algorithm is working correctly ?

I wanted to check the returned trending topics against the IDF scores, but there is no API to access them. Elasticsearch does return term vectors but there is no way to obtain IDF scores (as discussed here Accessing tf-idf)

My problem is as follows : I have not seen other trending topic algorithms use a background set explicitly, so I am reluctant to just deploy this as-is.

I see that I can use the background filter to change the background set. That would provide some proof via experimentation that the trending topics are changing as I vary the background set. Would you recommend that ?

Mark_Harwood · April 5, 2021, 4:54pm

You have to have some notion of normal language use to base your comparisons against otherwise “the” is always hot.

sj12345 · April 5, 2021, 5:20pm

true, you don't the word "the" to be hot. Maybe this algorithm makes sense in the ES world.

Is there a way to obtain IDF scores to verify ?

Otherwise, we can close this topic. thanks !

Mark_Harwood · April 5, 2021, 5:34pm

There are 4 numbers used in these calculations which produce a score.

The number of docs with the trending term in the foreground set
The size of the foreground set
The number of docs with the trending term in the background set
The size of the background set

We give you numbers 1 to 3 in the results and 4 is simply the number of docs in your index.
Scores calculated from these 4 numbers depend on your choice of significance heuristic algo.

system · May 3, 2021, 5:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Significant_terms aggregation with sampling Elasticsearch	2	216	January 20, 2023
Find trending article Elasticsearch	12	3535	December 23, 2019
How is the score of Significant Term aggregation calculated? Elasticsearch	7	625	September 12, 2018
Background count in significant_terms not consistent? Elasticsearch	8	3583	July 5, 2017
Aggregation across multiple indexes/indices - significant terms Elasticsearch	5	623	March 17, 2022

Find Trending topics without background set

Related topics