Categorisation of documents

vishva_deepak_tewari · May 12, 2017, 11:36am

Hi,

I have a problem where i have a bunch of terms or keywords for eg.
java, php , dot net etc. As can be expected these terms can have synonyms like (java is same as jdk, jre, or struts would have java as parent).
Now let's say i have a document that i got by user input, and I need to suggest best fit terms for this document. How i should approach this.

For eg. if the pre-defined terms are as follows
Information Technology

And let's say the document which we have as input is following

We urgently require software developer for our client. Candidates with knowledge of wireless technologies can apply for this job.

We should be able to identify that the user's important keywords are java developer and php developer.

I have tried approaching this prob. by treating all the terms as documents and indexed them in elasticsearch, and then i am using the input text as query. The problem in this approach is that if wireless technology is not present in the specified terms, it returns Information technology as the best fit.

I have tried solving this problem by using synonym for information technology(infotech). But is there a better way to handle this situation without editing synonym file every time.

Thanks
Vishvadeepak Tewari

Mark_Harwood · May 12, 2017, 1:27pm

If you have existing content to hand for training then new text content can be classified by either :

Using the new content as a query on old content to see how similar docs were categorized [1]
Use the "percolate" API [2] to run a bunch of pre-trained queries that define a category [3] over the new content

Cheers
Mark

[1] https://www.elastic.co/blog/text-classification-made-easy-with-elasticsearch
See the section on "MLT" query.
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
[3] Classifier query training using elasticsearch and a mixer (don't know what I was thinking....) : https://vimeo.com/98729151

system · June 9, 2017, 1:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Document categorization Elasticsearch	3	583	July 5, 2017
Tagging/categorizing documents with customized rules Elasticsearch	2	465	July 11, 2017
Classification Pattern: Percolate, Tag, Index Elasticsearch	1	385	July 6, 2017
Evaluating Elasticsearch for document classification with keywords Elasticsearch	2	2186	July 5, 2017
Classification pattern: Percolate, Tag, Index Elasticsearch	2	601	July 6, 2017

Categorisation of documents

Related topics