ShingleFilter with Stop List for Tag Cloud


(uzicorp) #1

Hello all,

I'm trying to implement a tag cloud for phrases but having the ability to remove stopwords from them. I've managed to build a single word tag cloud but wanted to expand this to phrases - or shingles in this case.

My mapping and query can be found here:

The idea is to break down sentences into pair phrases (limited to two for this example) and then be able to find the top x phrases within a given dataset. I also want stemming to remove plurals etc. want to return the whole word and not its stemmed form. This I can do using the kstem filter.

So the problem here is that I get the top x shingles but some of them contain stopwords which I thought were either removed using the stop filter or my exclude terms in the terms facet. The result of the query is below. Even though I have 'i' in the exclude list, it is still returned as 'and i' which is undesirable. 'a problem' is also undesirable as problem is already in the hits list. I can remove them if I specify 'and I' and 'a problem' in the exclude list, however this is not efficient to put all the combinations into the exclude list.

Is it possible to remove stopwords from shingles in a terms facet?

{
"took" : 214,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"blah" : {
"_type" : "terms",
"missing" : 0,
"total" : 21,
"other" : 4,
"terms" : [ {
"term" : "problem",
"count" : 3
}, {
"term" : "and i",
"count" : 3
}, {
"term" : "way",
"count" : 2
}, {
"term" : "no way",
"count" : 2
}, {
"term" : "a problem",
"count" : 2
}, {
"term" : "test message",
"count" : 1
}, {
"term" : "test",
"count" : 1
}, {
"term" : "my problem",
"count" : 1
}, {
"term" : "my",
"count" : 1
}, {
"term" : "message",
"count" : 1
} ]
}
}
}


(Shay Banon) #2

You can put the stop filter before the shingle one, it will remove the stop
words before the "shingles" are created.

On Thu, Apr 19, 2012 at 7:33 PM, uzicorp uzicorp@gmail.com wrote:

Hello all,

I'm trying to implement a tag cloud for phrases but having the ability to
remove stopwords from them. I've managed to build a single word tag cloud
but wanted to expand this to phrases - or shingles in this case.

My mapping and query can be found here:

https://gist.github.com/2422090

The idea is to break down sentences into pair phrases (limited to two for
this example) and then be able to find the top x phrases within a given
dataset. I also want stemming to remove plurals etc. want to return the
whole word and not its stemmed form. This I can do using the kstem filter.

So the problem here is that I get the top x shingles but some of them
contain stopwords which I thought were either removed using the stop filter
or my exclude terms in the terms facet. The result of the query is below.
Even though I have 'i' in the exclude list, it is still returned as 'and i'
which is undesirable. 'a problem' is also undesirable as problem is already
in the hits list. I can remove them if I specify 'and I' and 'a problem' in
the exclude list, however this is not efficient to put all the combinations
into the exclude list.

Is it possible to remove stopwords from shingles in a terms facet?

{
"took" : 214,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"blah" : {
"_type" : "terms",
"missing" : 0,
"total" : 21,
"other" : 4,
"terms" : [ {
"term" : "problem",
"count" : 3
}, {
"term" : "and i",
"count" : 3
}, {
"term" : "way",
"count" : 2
}, {
"term" : "no way",
"count" : 2
}, {
"term" : "a problem",
"count" : 2
}, {
"term" : "test message",
"count" : 1
}, {
"term" : "test",
"count" : 1
}, {
"term" : "my problem",
"count" : 1
}, {
"term" : "my",
"count" : 1
}, {
"term" : "message",
"count" : 1
} ]
}
}
}

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/ShingleFilter-with-Stop-List-for-Tag-Cloud-tp3923563p3923563.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(uzicorp) #3

Hello and thanks for looking at this.

I tried putting the stop filter before the shingle filter but the results
looked erroneous:

This is the analyzer section in the mapping:

"analyzer":{
"myAnalyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"stop",
"myCustomShingle",
"kstem"
]

And the result is:

{
"took" : 6,
"timed_out" : false,
"shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"blah" : {
"type" : "terms",
"missing" : 1,
"total" : 20,
"other" : 3,
"terms" : [ {
"term" : "problem",
"count" : 3
}, {
"term" : "
i",
"count" : 3
}, {
"term" : "way",
"count" : 2
}, {
"term" : "
way",
"count" : 2
}, {
"term" : "_ problem",
"count" : 2
}, {
"term" : "test message",
"count" : 1
}, {
"term" : "test",
"count" : 1
}, {
"term" : "my problem",
"count" : 1
}, {
"term" : "my",
"count" : 1
}, {
"term" : "message",
"count" : 1
} ]
}
}
}

Notice the underscores it has produced and that some of the stop words are
still present in the results. I still have 'i' in the exclude list in the
facet query.

Any tips on why that is happening or how to configure this properly would
be much appreciated.

On 21 April 2012 15:53, kimchy [via ElasticSearch Users] <
ml-node+s115913n3928262h32@n3.nabble.com> wrote:

You can put the stop filter before the shingle one, it will remove the
stop words before the "shingles" are created.

On Thu, Apr 19, 2012 at 7:33 PM, uzicorp <[hidden email]http://user/SendEmail.jtp?type=node&node=3928262&i=0

wrote:

Hello all,

I'm trying to implement a tag cloud for phrases but having the ability to
remove stopwords from them. I've managed to build a single word tag cloud
but wanted to expand this to phrases - or shingles in this case.

My mapping and query can be found here:

https://gist.github.com/2422090

The idea is to break down sentences into pair phrases (limited to two for
this example) and then be able to find the top x phrases within a given
dataset. I also want stemming to remove plurals etc. want to return the
whole word and not its stemmed form. This I can do using the kstem filter.

So the problem here is that I get the top x shingles but some of them
contain stopwords which I thought were either removed using the stop
filter
or my exclude terms in the terms facet. The result of the query is below.
Even though I have 'i' in the exclude list, it is still returned as 'and
i'
which is undesirable. 'a problem' is also undesirable as problem is
already
in the hits list. I can remove them if I specify 'and I' and 'a problem'
in
the exclude list, however this is not efficient to put all the
combinations
into the exclude list.

Is it possible to remove stopwords from shingles in a terms facet?

{
"took" : 214,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"blah" : {
"_type" : "terms",
"missing" : 0,
"total" : 21,
"other" : 4,
"terms" : [ {
"term" : "problem",
"count" : 3
}, {
"term" : "and i",
"count" : 3
}, {
"term" : "way",
"count" : 2
}, {
"term" : "no way",
"count" : 2
}, {
"term" : "a problem",
"count" : 2
}, {
"term" : "test message",
"count" : 1
}, {
"term" : "test",
"count" : 1
}, {
"term" : "my problem",
"count" : 1
}, {
"term" : "my",
"count" : 1
}, {
"term" : "message",
"count" : 1
} ]
}
}
}

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/ShingleFilter-with-Stop-List-for-Tag-Cloud-tp3923563p3923563.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion
below:

http://elasticsearch-users.115913.n3.nabble.com/ShingleFilter-with-Stop-List-for-Tag-Cloud-tp3923563p3928262.html
To unsubscribe from ShingleFilter with Stop List for Tag Cloud, click
herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3923563&code=dXppY29ycEBnbWFpbC5jb218MzkyMzU2M3wxMzA4ODg1Mjc3
.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml

--
Kind regards,

Usman


(system) #4