I followed the blog and ran the classification on the 20 News Group dataset, however I had a surprising experience. When tested the classification on the indexed documents as the test set, the accuracy was surprisingly low. I would like to know why is this the case. Here is some addition information.
Classification report:
Class | Precision | Recall | F1-Score | Support
In some more experimental runs that I made, I used different values for the min_doc_freq hyper-parameter for the MLT query and noticed that the accuracy improved. On a much smaller dataset, that I cannot make public, the accuracy on the training dataset improved from 60% to 92% when min_doc_freq was moved from default value 5 down to 1. The test set accuracy was 72% I think this is already commendable for the fact that it is for free and is very fast to set up. Great work guys!
I’m not sure about your corpus, but there’s a tricky balance eliminating terms/phrases where doc freq overstates how common they are. In other words, given the zipfian distribution of terms in a corpus, many, many (most?) terms occur just once. it’s difficult to know if a single term occurrence is “natural” or if it’s spurious and actually naturally more rare in the language.
So having a doc freq floor is often needed to ensure that indeed we can reliably say this is a “natural” rare term with confidence. But it’s a tricky balance. Most terms are rare, so get too aggressive and you’ll eliminate much of the vocabulary from your corpus.
It turns into a very corpus-specific exercise.
Ted Dunning has a classic paper “On Surprise and Coincidence” I highly recommend to understand term statistics. He has a method for gaining confidence in a term concurrence I find appealing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.