Text classification with ES

sanketshinde · April 30, 2018, 9:07pm

Hello,

I followed the blog and ran the classification on the 20 News Group dataset, however I had a surprising experience. When tested the classification on the indexed documents as the test set, the accuracy was surprisingly low. I would like to know why is this the case. Here is some addition information.

Classification report:
Class | Precision | Recall | F1-Score | Support

mapping file:
{
"properties":{
"content":{
"type":"text",
"analyzer":"english",
"term_vector": "yes"
},
"topic":{
"type":"text",
"analyzer":"english",
"fields":{
"raw":{
"type":"keyword"
}
}
}
}
}

MLT:

data = {
"query":
{
"more_like_this": {
"fields": [
"content",
"topic"
],
"like": "User text to classify",
"min_term_freq": 1,
"max_query_terms": 30
}
}
}

sanketshinde · May 2, 2018, 11:28am

In some more experimental runs that I made, I used different values for the min_doc_freq hyper-parameter for the MLT query and noticed that the accuracy improved. On a much smaller dataset, that I cannot make public, the accuracy on the training dataset improved from 60% to 92% when min_doc_freq was moved from default value 5 down to 1. The test set accuracy was 72% I think this is already commendable for the fact that it is for free and is very fast to set up. Great work guys!

Still, any insights, thoughts ?

softwaredoug · May 2, 2018, 11:58am

I’m not sure about your corpus, but there’s a tricky balance eliminating terms/phrases where doc freq overstates how common they are. In other words, given the zipfian distribution of terms in a corpus, many, many (most?) terms occur just once. it’s difficult to know if a single term occurrence is “natural” or if it’s spurious and actually naturally more rare in the language.

So having a doc freq floor is often needed to ensure that indeed we can reliably say this is a “natural” rare term with confidence. But it’s a tricky balance. Most terms are rare, so get too aggressive and you’ll eliminate much of the vocabulary from your corpus.

It turns into a very corpus-specific exercise.

Ted Dunning has a classic paper “On Surprise and Coincidence” I highly recommend to understand term statistics. He has a method for gaining confidence in a term concurrence I find appealing.

This blog article of mine might also be useful

https://opensourceconnections.com/blog/2016/03/29/semantic-search-with-latent-semantic-analysis/

Doug

system · May 30, 2018, 11:58am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.