What are the most popular contextual terms (after/before) of an expression?


I have a text field.

I would like to get a list of all the most popular contextual terms related to an expression e.g. "great house". By context I mean most popular terms next or before the expression found in the corpus. e.g. xx great house xx.

if lots of documents have in the text "nice great house" -> "nice" should be in such a list.

How to do such this in ES? / is ES the right tools for that?

Well one simple way, depending on the size of your data, is to create an index of bigrams by using a custom analyzer.

So for the input to analysis, you'd have

the great house at

and instead of breaking it up into words modify analysis to break it up into bigrams (two word tokens) using the shingle filter, like

[the great] [great house] [house at]

A prefix query on house\ * here yields all the occurrences of house SPACE some word, then simply do a terms aggregration, and you'll see an ordering of all the bigrams as a facet, ordered by how frequently the terms occur in the search results. You may need to further filter this so you don't see every bigram in these documents.

"buckets" : [ 
                    "key" : "house rules",
                    "doc_count" : 52
                    "key" : "house sucks",
                    "doc_count" : 42

The OTHER direction though is a bit trickier. You may need to duplicate your data to another field to get a different view. You can to wildcard * house queries, but they don't perform that well. Instead, you need to reverse the tokens BEFORE you do the prefix query. So in a completely separate field, you want to add a reverse filter to reverse the text AFTER shingling.


[good house]

becomes for examining the other direction:

[esuoh doog]

Then repeat the process for the other direction with a esuoh\ * query :smile: getting terms aggregations that you'll have to reverse yourself :slight_smile:

Fun problem, Hope that helps

Great doug,

  1. thanks for the the "before term" trick !

  2. about shingle filter, is there a way in ES to do a skip-gram modeling?

Not that I know of. Probably not directly in ES, but that's not quite my baliwick. My coauthor John Berryman who's much more of a data scientist would probably know better than I, you might try pinging him?