Confused about Shingle behavior

I'm really struggling to get proper shingle searching to work. I've tried
dozens of variations, using text, string_query, bools, dis_max. The whole
works. I simply cannot get it to function the way that I want. I imagine
I'm doing something fundamentally wrong, since this seems like an easy
behavior. My mapping looks like this: https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items which
use the word "blade" 4-5 times will rank higher than items that use the
phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--

As a followup, I've been toying with the bare minimum required to get
shingles working. This is my current query iteration, but it doesn't
return any results and I'm unsure why:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
"<span class="highlight">"
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"text_phrase":{
"body.partial_shingle":{
"query":"Great Planes Rotor Blade",
"analyzer":"analyzer_partial_shingle"
}
}
}
}

If I'm understanding correctly, this text_phrase query should

  1. Break the query "Great Planes Rotor Blade" into two bi-grams using *
    analyzer_partial_shingle*: ["Great Planes", "Planes Rotor"]. I've set
    unigrams = false, so only these two bigrams should be produced
  2. Text_phrase will take these two tokens and and perform an exact
    phrase match against body.partial_shingle.
  3. Any document with "Great Planes" as a token should be found.

Clearly I'm missing something, since this is not happening. Any tips?

Thanks!
-Zach

On Monday, November 12, 2012 11:56:52 PM UTC-5, Zachary Tong wrote:

I'm really struggling to get proper shingle searching to work. I've tried
dozens of variations, using text, string_query, bools, dis_max. The whole
works. I simply cannot get it to function the way that I want. I imagine
I'm doing something fundamentally wrong, since this seems like an easy
behavior. My mapping looks like this: https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items which
use the word "blade" 4-5 times will rank higher than items that use the
phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--

A couple of notes. 1) You are using a analyzer_term as a search analyzer
for body.partial_shingle, so bi-grams don't actually happen on the query
side. 2) not completely sure about your use case, but since you are using
shingles, wouldn't it make more sense to just use text query instead of
text_phrase since shingles are handling the "phrase" aspect already.

On Tuesday, November 13, 2012 8:42:27 AM UTC-5, Zachary Tong wrote:

As a followup, I've been toying with the bare minimum required to get
shingles working. This is my current query iteration, but it doesn't
return any results and I'm unsure why:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
"<span class="highlight">"
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"text_phrase":{
"body.partial_shingle":{
"query":"Great Planes Rotor Blade",
"analyzer":"analyzer_partial_shingle"
}
}
}
}

If I'm understanding correctly, this text_phrase query should

  1. Break the query "Great Planes Rotor Blade" into two bi-grams using *
    analyzer_partial_shingle*: ["Great Planes", "Planes Rotor"]. I've set
    unigrams = false, so only these two bigrams should be produced
  2. Text_phrase will take these two tokens and and perform an exact
    phrase match against body.partial_shingle.
  3. Any document with "Great Planes" as a token should be found.

Clearly I'm missing something, since this is not happening. Any tips?

Thanks!
-Zach

On Monday, November 12, 2012 11:56:52 PM UTC-5, Zachary Tong wrote:

I'm really struggling to get proper shingle searching to work. I've
tried dozens of variations, using text, string_query, bools, dis_max. The
whole works. I simply cannot get it to function the way that I want. I
imagine I'm doing something fundamentally wrong, since this seems like an
easy behavior. My mapping looks like this:
https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items which
use the word "blade" 4-5 times will rank higher than items that use the
phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--

Does specifying the analyzer in the query not affect which search analyzer
is used? If it doesn't, that's probably my problem!

I'll try the text query instead. To be honest, I'm kinda flailing around.
I'm not entirely certain how different queries work, so it's a lot of
trial and error =)

On Tuesday, November 13, 2012 9:11:17 AM UTC-5, Igor Motov wrote:

A couple of notes. 1) You are using a analyzer_term as a search analyzer
for body.partial_shingle, so bi-grams don't actually happen on the query
side. 2) not completely sure about your use case, but since you are using
shingles, wouldn't it make more sense to just use text query instead of
text_phrase since shingles are handling the "phrase" aspect already.

On Tuesday, November 13, 2012 8:42:27 AM UTC-5, Zachary Tong wrote:

As a followup, I've been toying with the bare minimum required to get
shingles working. This is my current query iteration, but it doesn't
return any results and I'm unsure why:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
"<span class="highlight">"
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"text_phrase":{
"body.partial_shingle":{
"query":"Great Planes Rotor Blade",
"analyzer":"analyzer_partial_shingle"
}
}
}
}

If I'm understanding correctly, this text_phrase query should

  1. Break the query "Great Planes Rotor Blade" into two bi-grams using
    analyzer_partial_shingle: ["Great Planes", "Planes Rotor"]. I've
    set unigrams = false, so only these two bigrams should be produced
  2. Text_phrase will take these two tokens and and perform an exact
    phrase match against body.partial_shingle.
  3. Any document with "Great Planes" as a token should be found.

Clearly I'm missing something, since this is not happening. Any tips?

Thanks!
-Zach

On Monday, November 12, 2012 11:56:52 PM UTC-5, Zachary Tong wrote:

I'm really struggling to get proper shingle searching to work. I've
tried dozens of variations, using text, string_query, bools, dis_max. The
whole works. I simply cannot get it to function the way that I want. I
imagine I'm doing something fundamentally wrong, since this seems like an
easy behavior. My mapping looks like this:
https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items which
use the word "blade" 4-5 times will rank higher than items that use the
phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--

You are right, the analyzer specified on the query should be applied. I
just didn't notice it:

$ curl "localhost:9200/shingles/_validate/query?pretty=true&explain=true"
-d '{
"text":{
"body.partial_shingle":{
"query":"Great Planes Rotor Blade",
"analyzer":"analyzer_partial_shingle"
}
}
}'

{
"valid" : true,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"explanations" : [ {
"index" : "shingles",
"valid" : true,
"explanation" : "body.partial_shingle:great planes
body.partial_shingle:great planes rotor body.partial_shingle:great planes
rotor blade body.partial_shingle:planes rotor body.partial_shingle:planes
rotor blade body.partial_shingle:rotor blade"
} ]
}

By the way, the query_string wouldn't work in your case because it splits
terms by spaces in query parsing phase, before analyzer can actual get to
the terms.

On Tue, Nov 13, 2012 at 9:14 AM, Zachary Tong zacharyjtong@gmail.comwrote:

Does specifying the analyzer in the query not affect which search analyzer
is used? If it doesn't, that's probably my problem!

I'll try the text query instead. To be honest, I'm kinda flailing around.
I'm not entirely certain how different queries work, so it's a lot of
trial and error =)

On Tuesday, November 13, 2012 9:11:17 AM UTC-5, Igor Motov wrote:

A couple of notes. 1) You are using a analyzer_term as a search analyzer
for body.partial_shingle, so bi-grams don't actually happen on the query
side. 2) not completely sure about your use case, but since you are using
shingles, wouldn't it make more sense to just use text query instead of
text_phrase since shingles are handling the "phrase" aspect already.

On Tuesday, November 13, 2012 8:42:27 AM UTC-5, Zachary Tong wrote:

As a followup, I've been toying with the bare minimum required to get
shingles working. This is my current query iteration, but it doesn't
return any results and I'm unsure why:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
"<span class="highlight">"
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"text_phrase":{
"body.partial_shingle":{
"query":"Great Planes Rotor Blade",
"analyzer":"analyzer_partial_**shingle"
}
}
}
}

If I'm understanding correctly, this text_phrase query should

  1. Break the query "Great Planes Rotor Blade" into two bi-grams
    using analyzer_partial_shingle: ["Great Planes", "Planes Rotor"].
    I've set unigrams = false, so only these two bigrams should be produced
  2. Text_phrase will take these two tokens and and perform an exact
    phrase match against body.partial_shingle.
  3. Any document with "Great Planes" as a token should be found.

Clearly I'm missing something, since this is not happening. Any tips?

Thanks!
-Zach

On Monday, November 12, 2012 11:56:52 PM UTC-5, Zachary Tong wrote:

I'm really struggling to get proper shingle searching to work. I've
tried dozens of variations, using text, string_query, bools, dis_max. The
whole works. I simply cannot get it to function the way that I want. I
imagine I'm doing something fundamentally wrong, since this seems like an
easy behavior. My mapping looks like this: https://gist.github.com/**
4063964 https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_**shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items
which use the word "blade" 4-5 times will rank higher than items that use
the phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--

--

Thanks for the help Igor, switching the query over to a "text" did the
trick. Good to know about query_string, I didn't realize it tokenized
before analyzing.

I'm not sure I understand why text_phrase didn't work, however. It doesn't
really matter now, but I'm curious so I can avoid that mistake in the
future. Why do individual tokens match the shingle, but not a full phrase?

-Zach

On Monday, November 12, 2012 11:56:52 PM UTC-5, Zachary Tong wrote:

I'm really struggling to get proper shingle searching to work. I've tried
dozens of variations, using text, string_query, bools, dis_max. The whole
works. I simply cannot get it to function the way that I want. I imagine
I'm doing something fundamentally wrong, since this seems like an easy
behavior. My mapping looks like this: https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items which
use the word "blade" 4-5 times will rank higher than items that use the
phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--

Yeah, not sure what's wrong with match_phrase. It works for me
though: https://gist.github.com/73fcd9f3e19165802e14

On Tuesday, November 13, 2012 1:10:47 PM UTC-5, Zachary Tong wrote:

Thanks for the help Igor, switching the query over to a "text" did the
trick. Good to know about query_string, I didn't realize it tokenized
before analyzing.

I'm not sure I understand why text_phrase didn't work, however. It
doesn't really matter now, but I'm curious so I can avoid that mistake in
the future. Why do individual tokens match the shingle, but not a full
phrase?

-Zach

On Monday, November 12, 2012 11:56:52 PM UTC-5, Zachary Tong wrote:

I'm really struggling to get proper shingle searching to work. I've
tried dozens of variations, using text, string_query, bools, dis_max. The
whole works. I simply cannot get it to function the way that I want. I
imagine I'm doing something fundamentally wrong, since this seems like an
easy behavior. My mapping looks like this:
https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items which
use the word "blade" 4-5 times will rank higher than items that use the
phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--

Weird. I am running a few versions behind (0.19.8), where *text *hasn't
been replaced with match yet, so perhaps it's due to that. Or something
else that I have wrong somewhere else.

Either way, thanks for the help, really appreciate it =)

On Tuesday, November 13, 2012 5:00:11 PM UTC-5, Igor Motov wrote:

Yeah, not sure what's wrong with match_phrase. It works for me though:
https://gist.github.com/73fcd9f3e19165802e14

On Tuesday, November 13, 2012 1:10:47 PM UTC-5, Zachary Tong wrote:

Thanks for the help Igor, switching the query over to a "text" did the
trick. Good to know about query_string, I didn't realize it tokenized
before analyzing.

I'm not sure I understand why text_phrase didn't work, however. It
doesn't really matter now, but I'm curious so I can avoid that mistake in
the future. Why do individual tokens match the shingle, but not a full
phrase?

-Zach

On Monday, November 12, 2012 11:56:52 PM UTC-5, Zachary Tong wrote:

I'm really struggling to get proper shingle searching to work. I've
tried dozens of variations, using text, string_query, bools, dis_max. The
whole works. I simply cannot get it to function the way that I want. I
imagine I'm doing something fundamentally wrong, since this seems like an
easy behavior. My mapping looks like this:
https://gist.github.com/4063964

Basically, I'm indexing a field with a normal tokenizer as well as a
shingle tokenizer. With regards to search, I want to match exact phrases
first, then match shingled phrases next (e.g. partial phrases). I'm
searching for "Great Planes Rotor Blade" using the following query:

{
"explain":true,
"size":5,
"from":0,
"highlight":{
"pre_tags":[
""
],
"post_tags":[
"
"
],
"fields":{
"body":{

  }
}

},
"query":{
"dis_max":{
"tie_breaker":0.7,
"queries":[
{
"text":{
"body":{
"query":"Great Planes Rotor Blade",
"type":"phrase"
}
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%"
}
},
{
"query_string":{
"fields":[
"body"
],
"query":"Great Planes Rotor Blade",
"phrase_slop":0,
"minimum_should_match":"40%",
"analyzer":"analyzer_partial_shingle"
}
}
]
}
}
}

Unfortunately, I'm getting results all over the place. Some items which
use the word "blade" 4-5 times will rank higher than items that use the
phrase "Great Planes" once. I assumed that shingling the query (using
analyzer_partial_shingle) and then searching the indexed shingles would
find "Great Planes" and increase the score, but it doesn't seem to be
working that way.

Anyone shed some light on what I'm doing wrong?

--