Text classification with ES


I followed the blog and ran the classification on the 20 News Group dataset, however I had a surprising experience. When tested the classification on the indexed documents as the test set, the accuracy was surprisingly low. I would like to know why is this the case. Here is some addition information.

In some more experimental runs that I made, I used different values for the min_doc_freq hyper-parameter for the MLT query and noticed that the accuracy improved. On a much smaller dataset, that I cannot make public, the accuracy on the training dataset improved from 60% to 92% when min_doc_freq was moved from default value 5 down to 1. The test set accuracy was 72% I think this is already commendable for the fact that it is for free and is very fast to set up. Great work guys!

Still, any insights, thoughts ?

I’m not sure about your corpus, but there’s a tricky balance eliminating terms/phrases where doc freq overstates how common they are. In other words, given the zipfian distribution of terms in a corpus, many, many (most?) terms occur just once. it’s difficult to know if a single term occurrence is “natural” or if it’s spurious and actually naturally more rare in the language.

So having a doc freq floor is often needed to ensure that indeed we can reliably say this is a “natural” rare term with confidence. But it’s a tricky balance. Most terms are rare, so get too aggressive and you’ll eliminate much of the vocabulary from your corpus.

It turns into a very corpus-specific exercise.

Ted Dunning has a classic paper “On Surprise and Coincidence” I highly recommend to understand term statistics. He has a method for gaining confidence in a term concurrence I find appealing.

This blog article of mine might also be useful



