What are the most popular contextual terms (after/before) of an expression?

nicom · September 20, 2015, 1:39pm

Hi,

I have a text field.

I would like to get a list of all the most popular contextual terms related to an expression e.g. "great house". By context I mean most popular terms next or before the expression found in the corpus. e.g. xx great house xx.

if lots of documents have in the text "nice great house" -> "nice" should be in such a list.

How to do such this in ES? / is ES the right tools for that?

softwaredoug · September 21, 2015, 2:05am

Well one simple way, depending on the size of your data, is to create an index of bigrams by using a custom analyzer.

So for the input to analysis, you'd have

the great house at

and instead of breaking it up into words modify analysis to break it up into bigrams (two word tokens) using the shingle filter, like

[the great] [great house] [house at]

A prefix query on house\ * here yields all the occurrences of house SPACE some word, then simply do a terms aggregration, and you'll see an ordering of all the bigrams as a facet, ordered by how frequently the terms occur in the search results. You may need to further filter this so you don't see every bigram in these documents.

"buckets" : [ 
                {
                    "key" : "house rules",
                    "doc_count" : 52
                },
                {
                    "key" : "house sucks",
                    "doc_count" : 42
                },
               ...
            ]
        }

The OTHER direction though is a bit trickier. You may need to duplicate your data to another field to get a different view. You can to wildcard * house queries, but they don't perform that well. Instead, you need to reverse the tokens BEFORE you do the prefix query. So in a completely separate field, you want to add a reverse filter to reverse the text AFTER shingling.

So:

[good house]

becomes for examining the other direction:

[esuoh doog]

Then repeat the process for the other direction with a esuoh\ * query getting terms aggregations that you'll have to reverse yourself

Fun problem, Hope that helps

nicom · September 21, 2015, 5:18am

Great doug,

thanks for the the "before term" trick !
about shingle filter, is there a way in ES to do a skip-gram modeling?

softwaredoug · September 21, 2015, 4:03pm

Not that I know of. Probably not directly in ES, but that's not quite my baliwick. My coauthor John Berryman who's much more of a data scientist would probably know better than I, you might try pinging him?