Keyword extraction


#1

I have a corpus of ~10K articles. For each article I would like to extract keywords (tags). So for every article I would like a ranking of the tokenized terms in the article based on their frequency in the article relative to their frequency in other articles in the corpus - along the lines of TF-IDF across the complete corpus.

I am hoping to find a clear A to Z guide. I've searched on google, google groups, stack overflow, etc.

I'm very new to ES (a week or two), but really like the platform.

Thanks so much!


(Mark Harwood) #2

Given you have a comparatively small set of docs and want to use existing elasticsearch features you could look at using the MoreLikeThisQuery on the text of each document and run a query with a highlighter to mark up the results. The query criteria should also filter the results to the ID of the original document from which you took the example text.
You can then parse the text of the marked-up response to retrieve those terms that were highlighted as these will be the search terms selected by MoreLikeThis on the basis of TF/IDF.

Cheers
Mark


(Valentin Pletzer) #3

Cool idea but highlighting does not seem to work for me. Do I explicitly set "store" to true? Or do I miss something else?

additional comment: when using a simply "match" instead of a mlt highlighting works just fine on the same data.


(Mark Harwood) #4

You'd need to share a gist of example doc, mappings and query to figure that out.
Sometimes with the highlighter you need to set this setting [1] if your search is on one field (e.g. _all) and you want to highlight another e.g. title.

[1] http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#field-match


#5

Thanks, Mark! Will try this weekend.


(Alex Ksikes) #6

This is a perfect use case of a new feature in the Term Vectors API called terms filtering. You can also use the validate API with rewrite:true against a More Like This query. You don't need to have term vectors stored for this to work, as they would be generated on the fly.


(system) #7