Find Trending topics without background set

i have read this article Find trending article

Referring to Mark Harwood's comment on that page : " maybe it’s failing to find anything statistically significant in the sample of 100 headlines that differs materially from other days."

I had a follow-up question. Is it possible to run a "significant text" type aggregation without any background set ?

I want to find trending topics (i.e. unusual phrases) within a particular set of documents without any reference to any background set

how can this be done ?

I’ve used various datasets as background before now e.g a thousand or so English Wikipedia articles or pages from Project Gutenberg books.
When put in the same index/field name as your content it provides background stats for everyday language and gives a baseline to diff against. You just need a “dataset_name” field or similar to query in order to make the foreground matches your content but non of the background.

Is there an independent metric to verify if this trending topic algorithm is working correctly ?

I wanted to check the returned trending topics against the IDF scores, but there is no API to access them. Elasticsearch does return term vectors but there is no way to obtain IDF scores (as discussed here Accessing tf-idf)

My problem is as follows : I have not seen other trending topic algorithms use a background set explicitly, so I am reluctant to just deploy this as-is.

I see that I can use the background filter to change the background set. That would provide some proof via experimentation that the trending topics are changing as I vary the background set. Would you recommend that ?

You have to have some notion of normal language use to base your comparisons against otherwise “the” is always hot.

true, you don't the word "the" to be hot. Maybe this algorithm makes sense in the ES world.

Is there a way to obtain IDF scores to verify ?

Otherwise, we can close this topic. thanks !

There are 4 numbers used in these calculations which produce a score.

  1. The number of docs with the trending term in the foreground set
  2. The size of the foreground set
  3. The number of docs with the trending term in the background set
  4. The size of the background set

We give you numbers 1 to 3 in the results and 4 is simply the number of docs in your index.
Scores calculated from these 4 numbers depend on your choice of significance heuristic algo.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.