Best Practice around keyphrase matching

Hey ES Community,

I'm looking to perform keyphrase analysis on some text and then be able to aggregate those keyphrases to determine what the most common topics are.

I already have a solution for keyphrase analsis, however i'm not 100% sure on what the best way to store it would be.

The issue is that the keyphrases sometimes have extra terms with them.
For example i may have "the exhibition" and "the marketing exhibition" and "exhibition".

My initial thoughts on it would be to store the keyphrase as a keyword, however i will have to somehow group the keywords in my terms query in order to understand that all of the above examples are about the "exhibition"

I think another approach may be to store it as analyzed text and perform the terms aggregation on that, as the analyzer would strip out the words like "the".

I'm not sure what the best practice would be for something like this. any thoughts?

If I follow correctly I think you're saying that you want to effectively consider these values as synonyms.

Stating my assumptions again, you have a tool that extracts structured information from unstructured data. This is done to normalize the various different ways of referring to the same concept. Typically this requires the introduction of a unique ID (consider Wikipedia articles - they might mention "JFK" or "35th US President" in the unstructured text but these are hyperlinked to a common structured string "John F. Kennedy - Wikipedia").

Structured data often leans towards what computers want rather than want humans want - computers want to match by unambiguous unique IDs (typically ID numbers eg 7423649) whereas humans want something more readable like a label. In elasticsearch/Kibana I tend to try combine the two by using keywords tokens that combine both an unambiguous unique ID and a label. Note that Wikipedia's URLs are an example of structured data that is both unique and human-readable.
When you have a tool that extracts structured data from free-text it is often useful to record where in the text these discoveries were made which is why we now have an annotated_text field type.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.