How does shingle filter work on match_phrase in query phase?


(陳智清) #1

How does shingle filter work on match_phrase in query phase?

After analyzing phrase "t1 t2 t3", shingle filter produced five tokens,
t1
t2
t3
"t1 t2"
"t2 t3"

Will match_phrase still give "t1 t2 t3" a match? How it works? Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33889bbd-9b01-4414-b579-4e625f0eec17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Cédric Hourcade) #2

Hello,

Let's say you have an indexed text "t1 t3 t3" with shingles. The token
positions are also indexed, so you get : t1 (at pos 1), "t1 t2" (pos
1), t2 (pos 2), "t2 t3" (pos 2) and t3 (pos 3).

So if you are searching with a match_phrase for "t1 t2 t3" (even if
not tokenized as shingles) it will matches the document, because t1,
t2 and t3 are considered next to each others (based on there recorded
position) for this document.

Cédric Hourcade
ced@wal.fr

On Fri, Jun 20, 2014 at 7:04 AM, 陳智清 walker0902@gmail.com wrote:

How does shingle filter work on match_phrase in query phase?

After analyzing phrase "t1 t2 t3", shingle filter produced five tokens,
t1
t2
t3
"t1 t2"
"t2 t3"

Will match_phrase still give "t1 t2 t3" a match? How it works? Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/33889bbd-9b01-4414-b579-4e625f0eec17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPNWyj-r6LtrWDXv_HGA-sgxfy%3DEu4Z5gJ5kRk_K2MWVNw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(陳智清) #3

Hello Hourcade, Thanks for your response.

Does that mean different values should be set to "index_analyzer" and
"search_analyzer"? (e.g. "index_analyzer": "shingle", and
"search_analyzer": "standard")
What if I want to re-use the same "shingle" analyzer in both index and
search? will the match_phrase "t1 t2 t3" still give me a match?

I know that set a different analyzer to "search_analyzer" makes
match_phrase "t1 t2 t3" searchable, but if I do that, then I get no benefit
from "shingle", right? Instead I get a bigger index size.

I assume "shingle" is used for faster "match_phrase" searches. But after
shingle, searching a phrase of 3 tokens "t1 t2 t3" becomes searching a
phrase of 5 tokens plus I don't know how "shingle" arranges the positions
for a correct phrase query. So how can "match_phrase" be faster? Thank you.

Cédric Hourcade於 2014年6月20日星期五UTC+8下午4時18分03秒寫道:

Hello,

Let's say you have an indexed text "t1 t3 t3" with shingles. The token
positions are also indexed, so you get : t1 (at pos 1), "t1 t2" (pos
1), t2 (pos 2), "t2 t3" (pos 2) and t3 (pos 3).

So if you are searching with a match_phrase for "t1 t2 t3" (even if
not tokenized as shingles) it will matches the document, because t1,
t2 and t3 are considered next to each others (based on there recorded
position) for this document.

Cédric Hourcade
c...@wal.fr <javascript:>

On Fri, Jun 20, 2014 at 7:04 AM, 陳智清 <walke...@gmail.com <javascript:>>
wrote:

How does shingle filter work on match_phrase in query phase?

After analyzing phrase "t1 t2 t3", shingle filter produced five tokens,
t1
t2
t3
"t1 t2"
"t2 t3"

Will match_phrase still give "t1 t2 t3" a match? How it works? Thank
you.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/33889bbd-9b01-4414-b579-4e625f0eec17%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/602477cb-d8f4-459b-8888-e6174662fbfd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Cédric Hourcade) #4

Yes, you can use two different analyzers. In your case what you can do is:

  • for the the indexation you apply a shingle filter.
  • for the query you also apply a shingle filter, but this time you
    disable the unigrams (output_unigrams: false), so it will only
    generate the shingles, in your case : "t1 t2" and "t2 t3". It will
    match your document.
    Cédric Hourcade
    ced@wal.fr

On Fri, Jun 20, 2014 at 12:30 PM, 陳智清 walker0902@gmail.com wrote:

Hello Hourcade, Thanks for your response.

Does that mean different values should be set to "index_analyzer" and
"search_analyzer"? (e.g. "index_analyzer": "shingle", and "search_analyzer":
"standard")
What if I want to re-use the same "shingle" analyzer in both index and
search? will the match_phrase "t1 t2 t3" still give me a match?

I know that set a different analyzer to "search_analyzer" makes match_phrase
"t1 t2 t3" searchable, but if I do that, then I get no benefit from
"shingle", right? Instead I get a bigger index size.

I assume "shingle" is used for faster "match_phrase" searches. But after
shingle, searching a phrase of 3 tokens "t1 t2 t3" becomes searching a
phrase of 5 tokens plus I don't know how "shingle" arranges the positions
for a correct phrase query. So how can "match_phrase" be faster? Thank you.

Cédric Hourcade於 2014年6月20日星期五UTC+8下午4時18分03秒寫道:

Hello,

Let's say you have an indexed text "t1 t3 t3" with shingles. The token
positions are also indexed, so you get : t1 (at pos 1), "t1 t2" (pos
1), t2 (pos 2), "t2 t3" (pos 2) and t3 (pos 3).

So if you are searching with a match_phrase for "t1 t2 t3" (even if
not tokenized as shingles) it will matches the document, because t1,
t2 and t3 are considered next to each others (based on there recorded
position) for this document.

Cédric Hourcade
c...@wal.fr

On Fri, Jun 20, 2014 at 7:04 AM, 陳智清 walke...@gmail.com wrote:

How does shingle filter work on match_phrase in query phase?

After analyzing phrase "t1 t2 t3", shingle filter produced five tokens,
t1
t2
t3
"t1 t2"
"t2 t3"

Will match_phrase still give "t1 t2 t3" a match? How it works? Thank
you.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/33889bbd-9b01-4414-b579-4e625f0eec17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/602477cb-d8f4-459b-8888-e6174662fbfd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPMAEGK%3DSxYfoBtjgcdZYPHqAAiSPpQBjh1fvtXgkwWuLA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(陳智清) #5

I got it! Thank you!!!!!

Cédric Hourcade於 2014年6月20日星期五UTC+8下午8時00分36秒寫道:

Yes, you can use two different analyzers. In your case what you can do is:

  • for the the indexation you apply a shingle filter.
  • for the query you also apply a shingle filter, but this time you
    disable the unigrams (output_unigrams: false), so it will only
    generate the shingles, in your case : "t1 t2" and "t2 t3". It will
    match your document.
    Cédric Hourcade
    c...@wal.fr <javascript:>

On Fri, Jun 20, 2014 at 12:30 PM, 陳智清 <walke...@gmail.com <javascript:>>
wrote:

Hello Hourcade, Thanks for your response.

Does that mean different values should be set to "index_analyzer" and
"search_analyzer"? (e.g. "index_analyzer": "shingle", and
"search_analyzer":
"standard")
What if I want to re-use the same "shingle" analyzer in both index and
search? will the match_phrase "t1 t2 t3" still give me a match?

I know that set a different analyzer to "search_analyzer" makes
match_phrase
"t1 t2 t3" searchable, but if I do that, then I get no benefit from
"shingle", right? Instead I get a bigger index size.

I assume "shingle" is used for faster "match_phrase" searches. But after
shingle, searching a phrase of 3 tokens "t1 t2 t3" becomes searching a
phrase of 5 tokens plus I don't know how "shingle" arranges the
positions
for a correct phrase query. So how can "match_phrase" be faster? Thank
you.

Cédric Hourcade於 2014年6月20日星期五UTC+8下午4時18分03秒寫道:

Hello,

Let's say you have an indexed text "t1 t3 t3" with shingles. The token
positions are also indexed, so you get : t1 (at pos 1), "t1 t2" (pos
1), t2 (pos 2), "t2 t3" (pos 2) and t3 (pos 3).

So if you are searching with a match_phrase for "t1 t2 t3" (even if
not tokenized as shingles) it will matches the document, because t1,
t2 and t3 are considered next to each others (based on there recorded
position) for this document.

Cédric Hourcade
c...@wal.fr

On Fri, Jun 20, 2014 at 7:04 AM, 陳智清 walke...@gmail.com wrote:

How does shingle filter work on match_phrase in query phase?

After analyzing phrase "t1 t2 t3", shingle filter produced five
tokens,

t1
t2
t3
"t1 t2"
"t2 t3"

Will match_phrase still give "t1 t2 t3" a match? How it works? Thank
you.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/33889bbd-9b01-4414-b579-4e625f0eec17%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/602477cb-d8f4-459b-8888-e6174662fbfd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/61083ccb-f678-4074-bd48-a4dbcc0c0511%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6