Which is the best (right) use of NGrams?

Hello,

I was reading this group posts and it seems to be two school of thoughts
for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match n-grammed
search term against ngrammed field in the index. The more matches the
better score. The precision is probably not as good as #1 so it would need
to be combined with search on original field and maybe shingled field. But
will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to be
    searched on any part of it
  2. Company name which need to be searched on individual words from start of
    the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The general approach is to index ngrams in a separate field and then craft
a query that searches on both fields but boosts matches on the non ngram
field. This way you match on partial words (ngrams) but favor matches on
whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of thoughts
for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match n-grammed
search term against ngrammed field in the index. The more matches the
better score. The precision is probably not as good as #1 so it would need
to be combined with search on original field and maybe shingled field. But
will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to be
    searched on any part of it
  2. Company name which need to be searched on individual words from start
    of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you Eric I understand that but you can use them in two ways as per my
post.
On Feb 20, 2013 12:45 PM, "egaumer" egaumer@gmail.com wrote:

The general approach is to index ngrams in a separate field and then craft
a query that searches on both fields but boosts matches on the non ngram
field. This way you match on partial words (ngrams) but favor matches on
whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of thoughts
for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match n-grammed
search term against ngrammed field in the index. The more matches the
better score. The precision is probably not as good as #1 so it would need
to be combined with search on original field and maybe shingled field. But
will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to be
    searched on any part of it
  2. Company name which need to be searched on individual words from start
    of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For autocomplete I typically use:

  • whitespace tokenizer
  • word delimiter token filter
  • edge-ngram token filter

At query time, I do not perform the edge-ngrams. This approach will
work for your 2nd use-case, but your first use-case is kind of tricky.
I would index that field twice, the first field would use:

  • keyword tokenizer
  • edge ngram

The 2nd field would use:

  • keyword tokenizer
  • reverse token filter
  • edge ngram

Again, skip the edge-ngrams at query time. This will allow prefix
matching and suffix matching on your contract number. A contract
number of 12345, you will get as a suggestion for queries of 12 or
345.

Hope this helps.

Thanks,
Matt Weber

On Wed, Feb 20, 2013 at 10:50 AM, Alex Roytman roytmana@gmail.com wrote:

Thank you Eric I understand that but you can use them in two ways as per my
post.

On Feb 20, 2013 12:45 PM, "egaumer" egaumer@gmail.com wrote:

The general approach is to index ngrams in a separate field and then craft
a query that searches on both fields but boosts matches on the non ngram
field. This way you match on partial words (ngrams) but favor matches on
whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of thoughts
for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is that
ngram length should essentially be longer that the longest word. That seems
to be an issue - while handful of characters is often enough to identify the
document (think auto-complete scenario) providing a longer than max ngram
length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match n-grammed
search term against ngrammed field in the index. The more matches the better
score. The precision is probably not as good as #1 so it would need to be
combined with search on original field and maybe shingled field. But will
potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to be
    searched on any part of it
  2. Company name which need to be searched on individual words from start
    of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Doing ngram analysis on the query side will usually introduce a lot of
noise (i.e., relevance is bad).

The problem with auto-suggest is that it's hard to get relevance tuned just
right because you're usually matching against very small text fragments. At
the same time, relevance is really subjective making it hard to measure
with any real accuracy. Doing ngram analysis on the query side exacerbates
the problem in my experience. With that said, use cases differ as does the
quality of the data driving the auto-suggest and that can cause your milage
to vary.

If I had doubts I'd just test both cases against my actual data and
requirements. That'll provide a more definitive answer.

-Eric

On Wednesday, February 20, 2013 1:50:48 PM UTC-5, AlexR wrote:

Thank you Eric I understand that but you can use them in two ways as per
my post.
On Feb 20, 2013 12:45 PM, "egaumer" <ega...@gmail.com <javascript:>>
wrote:

The general approach is to index ngrams in a separate field and then
craft a query that searches on both fields but boosts matches on the non
ngram field. This way you match on partial words (ngrams) but favor matches
on whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of thoughts
for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match
n-grammed search term against ngrammed field in the index. The more matches
the better score. The precision is probably not as good as #1 so it would
need to be combined with search on original field and maybe shingled field.
But will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to be
    searched on any part of it
  2. Company name which need to be searched on individual words from start
    of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Matt!

That is what I was going to do before I found some older thread about using
short ngrams at both indexing and searching. I was intrigued if it would
work well. Ed's experience (the post below) is that not very well. I am
going to try just to have some first hand experience but I am pretty sure
it is the approach you outlined I will goo with.

One question I have is whether you index direct and reverse edge engrams
into the same filed or two separate ones. Particularly as it relates to
highlighting. Will highlighting work if I index both into the same field?

Thanks

Alex

On Wednesday, February 20, 2013 2:17:52 PM UTC-5, Matt Weber wrote:

For autocomplete I typically use:

  • whitespace tokenizer
  • word delimiter token filter
  • edge-ngram token filter

At query time, I do not perform the edge-ngrams. This approach will
work for your 2nd use-case, but your first use-case is kind of tricky.
I would index that field twice, the first field would use:

  • keyword tokenizer
  • edge ngram

The 2nd field would use:

  • keyword tokenizer
  • reverse token filter
  • edge ngram

Again, skip the edge-ngrams at query time. This will allow prefix
matching and suffix matching on your contract number. A contract
number of 12345, you will get as a suggestion for queries of 12 or
345.

Hope this helps.

Thanks,
Matt Weber

On Wed, Feb 20, 2013 at 10:50 AM, Alex Roytman <royt...@gmail.com<javascript:>>
wrote:

Thank you Eric I understand that but you can use them in two ways as per
my
post.

On Feb 20, 2013 12:45 PM, "egaumer" <ega...@gmail.com <javascript:>>
wrote:

The general approach is to index ngrams in a separate field and then
craft
a query that searches on both fields but boosts matches on the non
ngram
field. This way you match on partial words (ngrams) but favor matches
on
whole tokens. This is generally where DisMax is useful because the
query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of
thoughts
for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better
is
the match . It is essentially generationg tonns of "synonyms" (ngrams)
for
your searched field and match your terms to them. One of the problem
is that
ngram length should essentially be longer that the longest word. That
seems
to be an issue - while handful of characters is often enough to
identify the
document (think auto-complete scenario) providing a longer than max
ngram
length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match
n-grammed
search term against ngrammed field in the index. The more matches the
better
score. The precision is probably not as good as #1 so it would need to
be
combined with search on original field and maybe shingled field. But
will
potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to
    be
    searched on any part of it
  2. Company name which need to be searched on individual words from
    start
    of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Ed. I suspected that much. But as you have suggested I will do a
quick test. Maybe the nature of the data (beginning of the alphanum
contract number is usually good deal less unique than the end) will make it
work well

On Wednesday, February 20, 2013 2:25:11 PM UTC-5, egaumer wrote:

Doing ngram analysis on the query side will usually introduce a lot of
noise (i.e., relevance is bad).

The problem with auto-suggest is that it's hard to get relevance tuned
just right because you're usually matching against very small text
fragments. At the same time, relevance is really subjective making it hard
to measure with any real accuracy. Doing ngram analysis on the query side
exacerbates the problem in my experience. With that said, use cases differ
as does the quality of the data driving the auto-suggest and that can cause
your milage to vary.

If I had doubts I'd just test both cases against my actual data and
requirements. That'll provide a more definitive answer.

-Eric

On Wednesday, February 20, 2013 1:50:48 PM UTC-5, AlexR wrote:

Thank you Eric I understand that but you can use them in two ways as per
my post.
On Feb 20, 2013 12:45 PM, "egaumer" ega...@gmail.com wrote:

The general approach is to index ngrams in a separate field and then
craft a query that searches on both fields but boosts matches on the non
ngram field. This way you match on partial words (ngrams) but favor matches
on whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of
thoughts for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match
n-grammed search term against ngrammed field in the index. The more matches
the better score. The precision is probably not as good as #1 so it would
need to be combined with search on original field and maybe shingled field.
But will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to
    be searched on any part of it
  2. Company name which need to be searched on individual words from
    start of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It will need to be two fields, one normal, one reverse. You are going
to need to experiment with highlighting... I have a feeling that is
going to give you some mixed results. BTW, the other poster is Eric,
not Ed. :slight_smile:

On Wed, Feb 20, 2013 at 2:56 PM, AlexR roytmana@gmail.com wrote:

Thanks Ed. I suspected that much. But as you have suggested I will do a
quick test. Maybe the nature of the data (beginning of the alphanum contract
number is usually good deal less unique than the end) will make it work well

On Wednesday, February 20, 2013 2:25:11 PM UTC-5, egaumer wrote:

Doing ngram analysis on the query side will usually introduce a lot of
noise (i.e., relevance is bad).

The problem with auto-suggest is that it's hard to get relevance tuned
just right because you're usually matching against very small text
fragments. At the same time, relevance is really subjective making it hard
to measure with any real accuracy. Doing ngram analysis on the query side
exacerbates the problem in my experience. With that said, use cases differ
as does the quality of the data driving the auto-suggest and that can cause
your milage to vary.

If I had doubts I'd just test both cases against my actual data and
requirements. That'll provide a more definitive answer.

-Eric

On Wednesday, February 20, 2013 1:50:48 PM UTC-5, AlexR wrote:

Thank you Eric I understand that but you can use them in two ways as per
my post.

On Feb 20, 2013 12:45 PM, "egaumer" ega...@gmail.com wrote:

The general approach is to index ngrams in a separate field and then
craft a query that searches on both fields but boosts matches on the non
ngram field. This way you match on partial words (ngrams) but favor matches
on whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of
thoughts for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is that
ngram length should essentially be longer that the longest word. That seems
to be an issue - while handful of characters is often enough to identify the
document (think auto-complete scenario) providing a longer than max ngram
length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match
n-grammed search term against ngrammed field in the index. The more matches
the better score. The precision is probably not as good as #1 so it would
need to be combined with search on original field and maybe shingled field.
But will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to
    be searched on any part of it
  2. Company name which need to be searched on individual words from
    start of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi there,

Interesting, I was experimenting with very similar use case (search
suggestions on [possibly] short list of one-to-few words codes) with
highlighting. It seems to be working fine and I can share more details if
you are interested (though I would like to check couple of details first to
make sure it is not buggy). My only concern is that my approach would not
scale well for large data (I am not using edgeNGrams but nGrams).

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry Eric :frowning: I was talking to Ed at the time and gut names confused....

On Wed, Feb 20, 2013 at 5:56 PM, AlexR roytmana@gmail.com wrote:

Thanks Ed. I suspected that much. But as you have suggested I will do a
quick test. Maybe the nature of the data (beginning of the alphanum
contract number is usually good deal less unique than the end) will make it
work well

On Wednesday, February 20, 2013 2:25:11 PM UTC-5, egaumer wrote:

Doing ngram analysis on the query side will usually introduce a lot of
noise (i.e., relevance is bad).

The problem with auto-suggest is that it's hard to get relevance tuned
just right because you're usually matching against very small text
fragments. At the same time, relevance is really subjective making it hard
to measure with any real accuracy. Doing ngram analysis on the query side
exacerbates the problem in my experience. With that said, use cases differ
as does the quality of the data driving the auto-suggest and that can cause
your milage to vary.

If I had doubts I'd just test both cases against my actual data and
requirements. That'll provide a more definitive answer.

-Eric

On Wednesday, February 20, 2013 1:50:48 PM UTC-5, AlexR wrote:

Thank you Eric I understand that but you can use them in two ways as per
my post.
On Feb 20, 2013 12:45 PM, "egaumer" ega...@gmail.com wrote:

The general approach is to index ngrams in a separate field and then
craft a query that searches on both fields but boosts matches on the non
ngram field. This way you match on partial words (ngrams) but favor matches
on whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of
thoughts for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match
n-grammed search term against ngrammed field in the index. The more matches
the better score. The precision is probably not as good as #1 so it would
need to be combined with search on original field and maybe shingled field.
But will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to
    be searched on any part of it
  2. Company name which need to be searched on individual words from
    start of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No worries, I've been called much worse :wink:

-Eric

On Wednesday, February 20, 2013 6:36:58 PM UTC-5, AlexR wrote:

Sorry Eric :frowning: I was talking to Ed at the time and gut names confused....

On Wed, Feb 20, 2013 at 5:56 PM, AlexR <royt...@gmail.com <javascript:>>wrote:

Thanks Ed. I suspected that much. But as you have suggested I will do a
quick test. Maybe the nature of the data (beginning of the alphanum
contract number is usually good deal less unique than the end) will make it
work well

On Wednesday, February 20, 2013 2:25:11 PM UTC-5, egaumer wrote:

Doing ngram analysis on the query side will usually introduce a lot of
noise (i.e., relevance is bad).

The problem with auto-suggest is that it's hard to get relevance tuned
just right because you're usually matching against very small text
fragments. At the same time, relevance is really subjective making it hard
to measure with any real accuracy. Doing ngram analysis on the query side
exacerbates the problem in my experience. With that said, use cases differ
as does the quality of the data driving the auto-suggest and that can cause
your milage to vary.

If I had doubts I'd just test both cases against my actual data and
requirements. That'll provide a more definitive answer.

-Eric

On Wednesday, February 20, 2013 1:50:48 PM UTC-5, AlexR wrote:

Thank you Eric I understand that but you can use them in two ways as
per my post.
On Feb 20, 2013 12:45 PM, "egaumer" ega...@gmail.com wrote:

The general approach is to index ngrams in a separate field and then
craft a query that searches on both fields but boosts matches on the non
ngram field. This way you match on partial words (ngrams) but favor matches
on whole tokens. This is generally where DisMax is useful because the query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of
thoughts for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match
n-grammed search term against ngrammed field in the index. The more matches
the better score. The precision is probably not as good as #1 so it would
need to be combined with search on original field and maybe shingled field.
But will potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to
    be searched on any part of it
  2. Company name which need to be searched on individual words from
    start of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Lukas,

It will be very interesting to compare notes. I will be out of town for few days and may not be able to conclude my test so lets touch base next week if it's ok with you

Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matt,

My understanding is that prefix and suffix edge ngrams will only deal with
searching on prefix and suffix but not on any internal substring of my
contract number. I think I have to go with short ngram at index and search
time and use match query with "and" to ensure precise (almost) match

BTW with suffix ngram I do not think there is a need to reverse filters as
edge engram can be applied from the end of the word. As I understand
reversing (index and search) is needed to get the back aligned ngrams but
since they are supported directly no need to use it

On Wednesday, February 20, 2013 2:17:52 PM UTC-5, Matt Weber wrote:

For autocomplete I typically use:

  • whitespace tokenizer
  • word delimiter token filter
  • edge-ngram token filter

At query time, I do not perform the edge-ngrams. This approach will
work for your 2nd use-case, but your first use-case is kind of tricky.
I would index that field twice, the first field would use:

  • keyword tokenizer
  • edge ngram

The 2nd field would use:

  • keyword tokenizer
  • reverse token filter
  • edge ngram

Again, skip the edge-ngrams at query time. This will allow prefix
matching and suffix matching on your contract number. A contract
number of 12345, you will get as a suggestion for queries of 12 or
345.

Hope this helps.

Thanks,
Matt Weber

On Wed, Feb 20, 2013 at 10:50 AM, Alex Roytman <royt...@gmail.com<javascript:>>
wrote:

Thank you Eric I understand that but you can use them in two ways as per
my
post.

On Feb 20, 2013 12:45 PM, "egaumer" <ega...@gmail.com <javascript:>>
wrote:

The general approach is to index ngrams in a separate field and then
craft
a query that searches on both fields but boosts matches on the non
ngram
field. This way you match on partial words (ngrams) but favor matches
on
whole tokens. This is generally where DisMax is useful because the
query
plays an important role in fine tuning the relevance.

-Eric

On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote:

Hello,

I was reading this group posts and it seems to be two school of
thoughts
for ngram use

  1. index with ngram enabled analyzer but search with analyzer without
    ngrams so that a complete search terms are matched against ngrams
  2. index with ngrams and search with ngrams

My understanding is:

#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better
is
the match . It is essentially generationg tonns of "synonyms" (ngrams)
for
your searched field and match your terms to them. One of the problem
is that
ngram length should essentially be longer that the longest word. That
seems
to be an issue - while handful of characters is often enough to
identify the
document (think auto-complete scenario) providing a longer than max
ngram
length search token will return no hits

#2 will need short 3-5 character ngrams at most and will match
n-grammed
search term against ngrammed field in the index. The more matches the
better
score. The precision is probably not as good as #1 so it would need to
be
combined with search on original field and maybe shingled field. But
will
potentially handle simple typos

I have two use cases (both to be used in auto-complete pick lists)

  1. A long identifier (contract number) 10-30 character which needs to
    be
    searched on any part of it
  2. Company name which need to be searched on individual words from
    start
    of the words (could use phrase prefix query or edgeNgram)

Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases

Thank you,
Alex

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Lukas,

I did a bit testing and I could see several approaches for autocomplete
style search in ANY part of a long identifier string (i.e. contract number)

  1. Indexing and searching with short ngrams and searching using match with
    and condition
  2. Searching by prefix or suffix (not any part) - indexing twice with start
    and back edge ngram and searching using term on un-engramed criteria
  3. Use long ngram (say from 3 to 40 characters in my case) longer than
    maximum length of indexed contract number and searching it with un-ngrammed
    criteria.

#1 works fine and highlights well. There may possibly be some cases false
hits but it should be pretty accurate in my case of searching contract
numbers

#2 works fine but it is limited to prefix/suffix searches

#3 searches fine but highlighting is very erratic - sometimes it highlights
and sometimes it does not the hits. Looks like a bug to me unless I am
missing something

Another option is to do back (reverse) edge engrams and do prefix search on
the result. I have not tried it but it should probably work well not sure
about highlighting though.

Would you share your findings?

Thank you,
Alex

On Wednesday, February 20, 2013 6:34:43 PM UTC-5, Lukáš Vlček wrote:

Hi there,

Interesting, I was experimenting with very similar use case (search
suggestions on [possibly] short list of one-to-few words codes) with
highlighting. It seems to be working fine and I can share more details if
you are interested (though I would like to check couple of details first to
make sure it is not buggy). My only concern is that my approach would not
scale well for large data (I am not using edgeNGrams but nGrams).

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tested searching using A) match/and with short (3 char) ngrams (index and
search time) vs B) using reverse (back aligned) edge ngrams with prefix
query.
My search is in strings (contract numbers) of about 20-30 characters and
when my search string is short (4-6 chars) A runs about 2-3 times faster
than B. When I plug 15-20 characters as my search string A and B run at
about the same speed

On Monday, February 25, 2013 2:28:15 PM UTC-5, AlexR wrote:

Hi Lukas,

I did a bit testing and I could see several approaches for autocomplete
style search in ANY part of a long identifier string (i.e. contract number)

  1. Indexing and searching with short ngrams and searching using match with
    and condition
  2. Searching by prefix or suffix (not any part) - indexing twice with
    start and back edge ngram and searching using term on un-engramed criteria
  3. Use long ngram (say from 3 to 40 characters in my case) longer than
    maximum length of indexed contract number and searching it with un-ngrammed
    criteria.

#1 works fine and highlights well. There may possibly be some cases false
hits but it should be pretty accurate in my case of searching contract
numbers

#2 works fine but it is limited to prefix/suffix searches

#3 searches fine but highlighting is very erratic - sometimes it
highlights and sometimes it does not the hits. Looks like a bug to me
unless I am missing something

Another option is to do back (reverse) edge engrams and do prefix search
on the result. I have not tried it but it should probably work well not
sure about highlighting though.

Would you share your findings?

Thank you,
Alex

On Wednesday, February 20, 2013 6:34:43 PM UTC-5, Lukáš Vlček wrote:

Hi there,

Interesting, I was experimenting with very similar use case (search
suggestions on [possibly] short list of one-to-few words codes) with
highlighting. It seems to be working fine and I can share more details if
you are interested (though I would like to check couple of details first to
make sure it is not buggy). My only concern is that my approach would not
scale well for large data (I am not using edgeNGrams but nGrams).

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Can you post a gist of a sample mapping/ sample query?

On Tuesday, February 26, 2013 3:14:17 AM UTC+4, AlexR wrote:

Tested searching using A) match/and with short (3 char) ngrams (index and
search time) vs B) using reverse (back aligned) edge ngrams with prefix
query.
My search is in strings (contract numbers) of about 20-30 characters and
when my search string is short (4-6 chars) A runs about 2-3 times faster
than B. When I plug 15-20 characters as my search string A and B run at
about the same speed

On Monday, February 25, 2013 2:28:15 PM UTC-5, AlexR wrote:

Hi Lukas,

I did a bit testing and I could see several approaches for autocomplete
style search in ANY part of a long identifier string (i.e. contract number)

  1. Indexing and searching with short ngrams and searching using match
    with and condition
  2. Searching by prefix or suffix (not any part) - indexing twice with
    start and back edge ngram and searching using term on un-engramed criteria
  3. Use long ngram (say from 3 to 40 characters in my case) longer than
    maximum length of indexed contract number and searching it with un-ngrammed
    criteria.

#1 works fine and highlights well. There may possibly be some cases false
hits but it should be pretty accurate in my case of searching contract
numbers

#2 works fine but it is limited to prefix/suffix searches

#3 searches fine but highlighting is very erratic - sometimes it
highlights and sometimes it does not the hits. Looks like a bug to me
unless I am missing something

Another option is to do back (reverse) edge engrams and do prefix search
on the result. I have not tried it but it should probably work well not
sure about highlighting though.

Would you share your findings?

Thank you,
Alex

On Wednesday, February 20, 2013 6:34:43 PM UTC-5, Lukáš Vlček wrote:

Hi there,

Interesting, I was experimenting with very similar use case (search
suggestions on [possibly] short list of one-to-few words codes) with
highlighting. It seems to be working fine and I can share more details if
you are interested (though I would like to check couple of details first to
make sure it is not buggy). My only concern is that my approach would not
scale well for large data (I am not using edgeNGrams but nGrams).

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Here it is

{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"filter": {
"doc_key_edge_ngram_back": {
"type": "edgeNGram",
"min_gram": 4,
"max_gram": 40,
"side": "back"
},
"doc_key_ngram_short": {
"min_gram": 4,
"max_gram": 4,
"type": "nGram"
}
},
"analyzer": {
"doc_key_partial": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"doc_key_edge_ngram_back"
],
"type": "custom"
},
"doc_key_partial_short_ngram": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"doc_key_ngram_short"
],
"type": "custom"
},
"doc_key": {
"tokenizer": "keyword",
"filter": [
"lowercase"
],
"type": "custom"
}
}
}
},
"mappings": {

...

  "piid": {
    "type": "multi_field",
    "fields": {
      "piid": {
        "type": "string",
        "analyzer": "doc_key"
      },
      "partial": {
        "type": "string",
        "search_analyzer": "doc_key",
        "index_analyzer": "doc_key_partial",
        "include_in_all": false
      },
      "partial_sng": {
        "type": "string",
        "analyzer": "doc_key_partial_short_ngram",
        "include_in_all": false
      }
  }
...
}

/* Here is how data looks like:

HHSN272NCU31551
HHSN263200000052112B
*/

//Query using short ngrams (not 100% precise - it can get false positives
on overlapping grams...)

{
"match": {
"piid.partial_sng": {
"query": query,
"operator": "and"
}
}

//Query using prefix on reverse edge ngrams

{
"prefix": {
"award.piid.partial": query
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Interesting. will play around with it and will post you back if I can find
a way (fast) to get rid of the false positives! Thanks for sharing.

Which one did you end up using? or are you still in research phase?

On Wednesday, February 27, 2013 4:38:47 AM UTC+4, AlexR wrote:

Here it is

{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"filter": {
"doc_key_edge_ngram_back": {
"type": "edgeNGram",
"min_gram": 4,
"max_gram": 40,
"side": "back"
},
"doc_key_ngram_short": {
"min_gram": 4,
"max_gram": 4,
"type": "nGram"
}
},
"analyzer": {
"doc_key_partial": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"doc_key_edge_ngram_back"
],
"type": "custom"
},
"doc_key_partial_short_ngram": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"doc_key_ngram_short"
],
"type": "custom"
},
"doc_key": {
"tokenizer": "keyword",
"filter": [
"lowercase"
],
"type": "custom"
}
}
}
},
"mappings": {

...

  "piid": {
    "type": "multi_field",
    "fields": {
      "piid": {
        "type": "string",
        "analyzer": "doc_key"
      },
      "partial": {
        "type": "string",
        "search_analyzer": "doc_key",
        "index_analyzer": "doc_key_partial",
        "include_in_all": false
      },
      "partial_sng": {
        "type": "string",
        "analyzer": "doc_key_partial_short_ngram",
        "include_in_all": false
      }
  }
...
}

/* Here is how data looks like:

HHSN272NCU31551
HHSN263200000052112B
*/

//Query using short ngrams (not 100% precise - it can get false positives
on overlapping grams...)

{
"match": {
"piid.partial_sng": {
"query": query,
"operator": "and"
}
}

//Query using prefix on reverse edge ngrams

{
"prefix": {
"award.piid.partial": query
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Still prototyping. For now I use prefix query on back aligned edge ngrams. I will experiment some more. Wonder how match/phrase works against short ngrams vs match/and
Please share your findings

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.