Partial match of sub-phrases to be scored higher?


(Ark) #1

Hello,

How can I model the query and/or mapping so that a partial match of a
sub-phrase has an higher score than what a edgengram would return?

For example, If I have four documents:

  1. foo bar blah
  2. foo blah bar
  3. bar foo blah
  4. bar blah foo

If the search string is "bar bl", I would like document 1 and 4 should be
scored higher than document 2 and 3.

If the field is indexed using edgengram, all 4 documents would match (which
is fine for my use-case) but I think the scoring cannot yield the result I
am looking for.

There is also a "match_phrase_prefix" but that would match only #4.

Thanks
Ark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

On Mon, Sep 16, 2013 at 4:08 PM, Ark ayam12yeh34@gmail.com wrote:

Hello,

Hi,

How can I model the query and/or mapping so that a partial match of a
sub-phrase has an higher score than what a edgengram would return?

For example, If I have four documents:

  1. foo bar blah
  2. foo blah bar
  3. bar foo blah
  4. bar blah foo

If the search string is "bar bl", I would like document 1 and 4 should be
scored higher than document 2 and 3.

If the field is indexed using edgengram, all 4 documents would match
(which is fine for my use-case) but I think the scoring cannot yield the
result I am looking for.

There is also a "match_phrase_prefix" but that would match only #4.

You could use the edgeNGram filter on top of the shingle[1] filter (with
output_unigrams=false). This would allow you to boost on prefixes and
positions at the same time.

The fact that you are interested in prefix matches makes me wonder whether
you are trying to implement auto-completion: if this is the case, a better
option could be to use the completion suggest[2] (which is way faster than
any index-based solution) and use all suffixes of your text as inputs. For
example, the "foo bar blah" suggestion could be indexed with "input": ["foo
bar blah", "bar blah", "blah"]. If you are not trying to implement
auto-completion, you can safely ignore this comment. :slight_smile:

[1]
http://www.elasticsearch.org/guide/reference/index-modules/analysis/shingle-tokenfilter/
[2]
http://www.elasticsearch.org/guide/reference/api/search/completion-suggest/

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ark) #3

Thank you! I am not sure if my case falls into auto-complete - as I am just
now learning some concepts and looking at what is possible. But having said
that, your suggestion does sound interesting and possibly something that
may be a good fit.

Ark

On Monday, September 16, 2013 1:23:40 PM UTC-5, Adrien Grand wrote:

On Mon, Sep 16, 2013 at 4:08 PM, Ark <ayam1...@gmail.com <javascript:>>wrote:

Hello,

Hi,

How can I model the query and/or mapping so that a partial match of a
sub-phrase has an higher score than what a edgengram would return?

For example, If I have four documents:

  1. foo bar blah
  2. foo blah bar
  3. bar foo blah
  4. bar blah foo

If the search string is "bar bl", I would like document 1 and 4 should be
scored higher than document 2 and 3.

If the field is indexed using edgengram, all 4 documents would match
(which is fine for my use-case) but I think the scoring cannot yield the
result I am looking for.

There is also a "match_phrase_prefix" but that would match only #4.

You could use the edgeNGram filter on top of the shingle[1] filter (with
output_unigrams=false). This would allow you to boost on prefixes and
positions at the same time.

The fact that you are interested in prefix matches makes me wonder whether
you are trying to implement auto-completion: if this is the case, a better
option could be to use the completion suggest[2] (which is way faster than
any index-based solution) and use all suffixes of your text as inputs. For
example, the "foo bar blah" suggestion could be indexed with "input": ["foo
bar blah", "bar blah", "blah"]. If you are not trying to implement
auto-completion, you can safely ignore this comment. :slight_smile:

[1]
http://www.elasticsearch.org/guide/reference/index-modules/analysis/shingle-tokenfilter/
[2]
http://www.elasticsearch.org/guide/reference/api/search/completion-suggest/

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4