Shingle filter for sub phrase matching


(nicktackes) #1

I have created a gist with an analyzer that uses filter shingle in attempt
to match sub phrases.

For instance I have entries in the table with discrete phrases like

EGFR
Lung Cancer
Lung
Cancer

and I want to match these when searching the phrase 'EGFR related lung
cancer

My expectation is that the multi word matches score higher than the single
matches, for instance...

  1. Lung Cancer
  2. Lung
  3. Cancer
  4. EGFR

Additionally, I tried a standard analyzer match but this didn't yield the
desired result either. One complicating aspect to this approach is that the
min_shingle_size has to be 2 or more.

How then would I be able to match single words like 'EGFR' or 'Lung'?

thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/041756c9-39b0-43df-a309-518b8dcb4326%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(nicktackes) #2

#create a test index with shingle mapping
curl -XPUT localhost:9200/test -d '{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_shingle":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_stop", "filter_shingle"]
}
},
"filter":{
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"filter_stop":{
"type":"stop",
"stopwords":[
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "will", "with"
]
}
}
}
}
},
"mappings":{
"product":{
"properties":{
"title":{
"search_analyzer":"analyzer_shingle",
"index_analyzer":"analyzer_shingle",
"type":"string"
}
}
}
}
}'

#Add some docs to the index
curl -XPOST localhost:9200/test/product/1 -d '{"title" : "EGFR"}'
curl -XPOST localhost:9200/test/product/1 -d '{"title" : "WAS"}'
curl -XPOST localhost:9200/test/product/2 -d '{"title" : "Lung Cancer"}'
curl -XPOST localhost:9200/test/product/3 -d '{"title" : "Lung"}'
curl -XPOST localhost:9200/test/product/3 -d '{"title" : "Cancer"}'

curl -XPOST localhost:9200/test/_refresh

#Analyze API to check out shingling
curl -XGET 'localhost:9200/test/_analyze?analyzer=analyzer_shingle&pretty' -d 'EGFR and WAS Lung Cancer' | grep token

#Sample search should return should return EGFR, Lung Cancer, Lung, Cancer
curl -XGET 'localhost:9200/test/product/_search?q=title:EGFR+Lung+Cancer&pretty'

#Sample search with stop word should return EGFR, WAS, Lung Cancer, Lung, Cancer
curl -XGET 'localhost:9200/test/product/_search?q=title:EGFR+and+WAS+Lung+Cancer&pretty'

#Sample search with seperating word should return EGFR, Lung Cancer, Lung, Cancer
curl -XGET 'localhost:9200/test/product/_search?q=title:EGFR+and+Lung+related+Cancer&pretty'

#Sample search with seperating word should return EGFR, Lung Cancer, Lung, Cancer
curl -XGET localhost:9200/test/product/_search?pretty -d '{
"query" : {
"match" : {
"title" : {
"query" : "EGFR and Lung related Cancer",
"analyzer":"standard"
}
}
}
}'

curl -X DELETE localhost:9200/test

On Wednesday, July 23, 2014 9:37:03 AM UTC-5, Nick Tackes wrote:

I have created a gist with an analyzer that uses filter shingle in
attempt to match sub phrases.

For instance I have entries in the table with discrete phrases like

EGFR
Lung Cancer
Lung
Cancer

and I want to match these when searching the phrase 'EGFR related lung
cancer

My expectation is that the multi word matches score higher than the single
matches, for instance...

  1. Lung Cancer
  2. Lung
  3. Cancer
  4. EGFR

Additionally, I tried a standard analyzer match but this didn't yield the
desired result either. One complicating aspect to this approach is that the
min_shingle_size has to be 2 or more.

How then would I be able to match single words like 'EGFR' or 'Lung'?

thanks

https://gist.github.com/nicktackes/ffdbf22aba393efc2169.js

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/471d07e5-fbb5-46d8-8e36-01c1a7eb4ec3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Brian Keith) #3

Did you ever figure this out? I have the same exact issue but using
different words.

On Wednesday, July 23, 2014 at 10:37:03 AM UTC-4, Nick Tackes wrote:

I have created a gist with an analyzer that uses filter shingle in
attempt to match sub phrases.

For instance I have entries in the table with discrete phrases like

EGFR
Lung Cancer
Lung
Cancer

and I want to match these when searching the phrase 'EGFR related lung
cancer

My expectation is that the multi word matches score higher than the single
matches, for instance...

  1. Lung Cancer
  2. Lung
  3. Cancer
  4. EGFR

Additionally, I tried a standard analyzer match but this didn't yield the
desired result either. One complicating aspect to this approach is that the
min_shingle_size has to be 2 or more.

How then would I be able to match single words like 'EGFR' or 'Lung'?

thanks

https://gist.github.com/nicktackes/ffdbf22aba393efc2169.js

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9f480904-aca7-468b-9d43-4243b65899df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Kemp) #4

There are a couple of problems with the example code:

  • the second document posted with id '1' (which was EFGR) replaces the first document with id '1' and so you lose the EFGR document. Similarly with the posts to id '3'. That is probably why Nick could not find EFGR or Lung. He should have been able to find "Cancer" especially since he set output_unigrams to true.
  • scoring is determined per shard, and so the scores (and hence sort order) will be meaningless with such a small number of documents. If you set number_of_shards to 1 you will get a more sensible sort order.

Cheers,
David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1d22c27a-c2d5-4376-9727-525745b8b9ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5