Keyword extraction

emoe · May 5, 2015, 2:31pm

I have a corpus of ~10K articles. For each article I would like to extract keywords (tags). So for every article I would like a ranking of the tokenized terms in the article based on their frequency in the article relative to their frequency in other articles in the corpus - along the lines of TF-IDF across the complete corpus.

I am hoping to find a clear A to Z guide. I've searched on google, google groups, stack overflow, etc.

I'm very new to ES (a week or two), but really like the platform.

Thanks so much!

Mark_Harwood · May 5, 2015, 6:15pm

Given you have a comparatively small set of docs and want to use existing elasticsearch features you could look at using the MoreLikeThisQuery on the text of each document and run a query with a highlighter to mark up the results. The query criteria should also filter the results to the ID of the original document from which you took the example text.
You can then parse the text of the marked-up response to retrieve those terms that were highlighted as these will be the search terms selected by MoreLikeThis on the basis of TF/IDF.

Cheers
Mark

Valentin_Pletzer · May 6, 2015, 8:19pm

Cool idea but highlighting does not seem to work for me. Do I explicitly set "store" to true? Or do I miss something else?

additional comment: when using a simply "match" instead of a mlt highlighting works just fine on the same data.

Mark_Harwood · May 6, 2015, 9:01pm

You'd need to share a gist of example doc, mappings and query to figure that out.
Sometimes with the highlighter you need to set this setting [1] if your search is on one field (e.g. _all) and you want to highlight another e.g. title.

[1] http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#field-match

emoe · May 7, 2015, 9:57pm

Thanks, Mark! Will try this weekend.

Alex_Ksikes · May 11, 2015, 10:12am

This is a perfect use case of a new feature in the Term Vectors API called terms filtering. You can also use the validate API with rewrite:true against a More Like This query. You don't need to have term vectors stored for this to work, as they would be generated on the fly.

Topic		Replies	Views
Highlighting of Keywords Elasticsearch	3	459	July 5, 2017
Highlighting and text_expansion query Elasticsearch esre-elasticsearch-relevance-engine	3	837	July 1, 2023
A deeper understanding of term vectors and the more like this query Elasticsearch	1	536	July 5, 2017
ES highlight question Elasticsearch	3	417	July 5, 2017
Count how many terms match document Elasticsearch	4	2510	September 11, 2017

Keyword extraction

Related topics