Categorisation of documents

(Vishva Deepak Tewari) #1


I have a problem where i have a bunch of terms or keywords for eg.
java, php , dot net etc. As can be expected these terms can have synonyms like (java is same as jdk, jre, or struts would have java as parent).
Now let's say i have a document that i got by user input, and I need to suggest best fit terms for this document. How i should approach this.

For eg. if the pre-defined terms are as follows
Information Technology

And let's say the document which we have as input is following

We urgently require software developer for our client. Candidates with knowledge of wireless technologies can apply for this job.

We should be able to identify that the user's important keywords are java developer and php developer.

I have tried approaching this prob. by treating all the terms as documents and indexed them in elasticsearch, and then i am using the input text as query. The problem in this approach is that if wireless technology is not present in the specified terms, it returns Information technology as the best fit.

I have tried solving this problem by using synonym for information technology(infotech). But is there a better way to handle this situation without editing synonym file every time.

Vishvadeepak Tewari

(Mark Harwood) #2

If you have existing content to hand for training then new text content can be classified by either :

  • Using the new content as a query on old content to see how similar docs were categorized [1]
  • Use the "percolate" API [2] to run a bunch of pre-trained queries that define a category [3] over the new content


See the section on "MLT" query.
[3] Classifier query training using elasticsearch and a mixer (don't know what I was thinking....) :

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.