Hello all,
I'm trying to implement a tag cloud for phrases but having the ability to remove stopwords from them. I've managed to build a single word tag cloud but wanted to expand this to phrases - or shingles in this case.
My mapping and query can be found here:
The idea is to break down sentences into pair phrases (limited to two for this example) and then be able to find the top x phrases within a given dataset. I also want stemming to remove plurals etc. want to return the whole word and not its stemmed form. This I can do using the kstem filter.
So the problem here is that I get the top x shingles but some of them contain stopwords which I thought were either removed using the stop filter or my exclude terms in the terms facet. The result of the query is below. Even though I have 'i' in the exclude list, it is still returned as 'and i' which is undesirable. 'a problem' is also undesirable as problem is already in the hits list. I can remove them if I specify 'and I' and 'a problem' in the exclude list, however this is not efficient to put all the combinations into the exclude list.
Is it possible to remove stopwords from shingles in a terms facet?
{
"took" : 214,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"blah" : {
"_type" : "terms",
"missing" : 0,
"total" : 21,
"other" : 4,
"terms" : [ {
"term" : "problem",
"count" : 3
}, {
"term" : "and i",
"count" : 3
}, {
"term" : "way",
"count" : 2
}, {
"term" : "no way",
"count" : 2
}, {
"term" : "a problem",
"count" : 2
}, {
"term" : "test message",
"count" : 1
}, {
"term" : "test",
"count" : 1
}, {
"term" : "my problem",
"count" : 1
}, {
"term" : "my",
"count" : 1
}, {
"term" : "message",
"count" : 1
} ]
}
}
}