Searching _all field for joined and separate words using shingle

I have an index analyser setup that uses the following shingle:

filter_shingle: {
type: "shingle",
min_shingle_size: 2,
max_shingle_size: 2, output_unigrams: true,
token_separator: ''
}

This means that "foo bar" will be indexed as "foo", "bar", "foobar".

This means if an entry contains "foo bar" then the user can search using
either "foo bar" or "foobar" and it will match which helps because
sometimes users don't know if what they are searching for is one word or
two.

However, if an entry contains "foobar" which would just be indexed as
"foobar" then a user would have to search for it using "foobar" since
searching for "foo bar" would not return any matches in this case. This is
a problem if the info was posted to the database as one word when really
most people search for it as two words.

One option to get around this was to apply ngrams (eg min 3 - max 15) after
the shingle so that the shingles are then split into ngrams which would
mean that at some point in amongst all the other ngrams "foo" and "bar"
would be indexed and the user would then be able to search for "foobar"
using either "foobar" or "foo bar".

One of the fields that my _all field relies on isn't a small field so
applying ngrams isn't appropriate and will bloat the index for very little
gain and possibly reduce the accuracy of the results.

I tried applying shingle to the search analyser as well so that the search
term was joined but that didn't work because "foo bar" would be joined to
make "foobar" but it would also be searching for "foo" and "bar" separately
and I really want it to be an exact match so that didn't work.

I could add a second search option (another "should { .... }") so that one
would look for joined words and the other would look for separate words but
that only works if the user is limited to searching with either 1 or 2
search words. If they type 3 search words the idea fails.

I am suggesting that it should solve my problem if there was a way that
when searching a shingled field that if the shingle matches then the
unigrams can be considered matched as well.
This would be a new boolean attribute on the shingle token filter.
This new attribute would be applied to the shingle token filter on a search
analyser and if set to true it would mean that if a user searched for "foo
bar" then it would mark "foobar" as a match but also as a consequence "foo"
and "bar" that were joined using shingle to make "foobar" would be matched
as well.
I'm not sure but could this be done using the start_offset and end_offset
values associated with each token?
If "foo bar" is analysed it is split into the following:

{
"tokens" : [ {
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "foobar",
"start_offset" : 0,
"end_offset" : 7,
"type" : "shingle",
"position" : 1
}, {
"token" : "bar",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}

so if "foobar" (0-7) matches then can the start_offset and end_offset not
be used to qualify "foo" (0-3) and "bar" (4-7) as matches?

Hopefully this would mean that, for example, a search for "foo bar fak"
would return a result called "foobar fak" as well as "foo barfak" because
they are resulting in 100% coverage of all the search words.

If there's already a way to do something like this against the _all field
then I'm all ears.

Col

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If it makes any difference I'm using the Tire gem for Ruby on Rails.

Col

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There's also the unique token filter but I'm not sure that applying this to
my search analyser would filter out the search terms that became redundant
if a shingle was found in the index.

Col

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Anyone?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.