Referring to Mark Harwood's comment on that page : " maybe it’s failing to find anything statistically significant in the sample of 100 headlines that differs materially from other days."
I had a follow-up question. Is it possible to run a "significant text" type aggregation without any background set ?
I want to find trending topics (i.e. unusual phrases) within a particular set of documents without any reference to any background set
I’ve used various datasets as background before now e.g a thousand or so English Wikipedia articles or pages from Project Gutenberg books.
When put in the same index/field name as your content it provides background stats for everyday language and gives a baseline to diff against. You just need a “dataset_name” field or similar to query in order to make the foreground matches your content but non of the background.
Is there an independent metric to verify if this trending topic algorithm is working correctly ?
I wanted to check the returned trending topics against the IDF scores, but there is no API to access them. Elasticsearch does return term vectors but there is no way to obtain IDF scores (as discussed here Accessing tf-idf)
My problem is as follows : I have not seen other trending topic algorithms use a background set explicitly, so I am reluctant to just deploy this as-is.
I see that I can use the background filter to change the background set. That would provide some proof via experimentation that the trending topics are changing as I vary the background set. Would you recommend that ?
There are 4 numbers used in these calculations which produce a score.
The number of docs with the trending term in the foreground set
The size of the foreground set
The number of docs with the trending term in the background set
The size of the background set
We give you numbers 1 to 3 in the results and 4 is simply the number of docs in your index.
Scores calculated from these 4 numbers depend on your choice of significance heuristic algo.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.