Hyphen search


(such_mensch) #1

What kind of tokenizers are the best for search for words with hyphens in
them??
Example: a search for "test" between following phrases {"test", "test-2",
"3-test"} should return all 3 and not just "test".


(simonw-2) #2

I'd likely use word-delimiter filter and preserve the original. This might
be the easiest and likely most effective solution for what you wanna do and
it's independent from tokenization. see this for
documentation: http://www.elasticsearch.org/guide/reference/index-modules/analysis/word-delimiter-tokenfilter.html

I'd use generate_word_parts = true, catenate_words = true, preserve_original
= true
as a start..

simon

On Friday, July 20, 2012 4:11:10 PM UTC+2, such_mensch wrote:

What kind of tokenizers are the best for search for words with hyphens in
them??
Example: a search for "test" between following phrases {"test", "test-2",
"3-test"} should return all 3 and not just "test".


(John Ohno) #3

You can get away with not modifying the tokenization procedure by dropping
into lucene syntax and translating the query into 'test', but it'll make
the searching linear time with respect to shard size. If you retokenized,
you'd have to use proximity to find the whole phrase, which could quite
possibly be slower.

On Friday, July 20, 2012 10:11:10 AM UTC-4, such_mensch wrote:

What kind of tokenizers are the best for search for words with hyphens in
them??
Example: a search for "test" between following phrases {"test", "test-2",
"3-test"} should return all 3 and not just "test".


(simonw-2) #4

On Friday, July 20, 2012 6:48:57 PM UTC+2, johno wrote:

You can get away with not modifying the tokenization procedure by dropping
into lucene syntax and translating the query into 'test', but it'll make
the searching linear time with respect to shard size. If you retokenized,
you'd have to use proximity to find the whole phrase, which could quite
possibly be slower.

I would absolutely not recommend you to do this! Leading wildcards are 1.
very expensive to "rewrite" the query and 2. possible very expensive to
execute your search. The linear factor you are talking about is the number
of unique terms in your dictionary. This can easily take multiple seconds
to just build up the query on a small index. Even I'd not recommend to use
a trailing wildcard unless you really really need to. Such a query can
expand to hundreds of terms and you are basically calculating the
disjunction of those at query time.

By default leading wildcards are disabled in the lucene query parser
ElasticSearch allows it by default which is bad IMO. I'd totally set to to
false via the settings:
"indices.query.query_string.allowLeadingWildcard" : false

this could become very dangerous if you users can enter leading wildcards
directly!

simon

On Friday, July 20, 2012 10:11:10 AM UTC-4, such_mensch wrote:

What kind of tokenizers are the best for search for words with hyphens in
them??
Example: a search for "test" between following phrases {"test", "test-2",
"3-test"} should return all 3 and not just "test".


(John Ohno) #5

Ideally, Lucene core would have a sensible wildcard implementation and
store reverse indexes -- because having linear scans for leading wildcard
searches (or for any searches, for that matter) is unacceptable. But, on
the scale of difficulty, between performing a complete re-index with
different tokenization parameters trumps performing a single linear scan.

On Friday, July 20, 2012 1:42:04 PM UTC-4, simonw wrote:

On Friday, July 20, 2012 6:48:57 PM UTC+2, johno wrote:

You can get away with not modifying the tokenization procedure by
dropping into lucene syntax and translating the query into 'test', but
it'll make the searching linear time with respect to shard size. If you
retokenized, you'd have to use proximity to find the whole phrase, which
could quite possibly be slower.

I would absolutely not recommend you to do this! Leading wildcards are 1.
very expensive to "rewrite" the query and 2. possible very expensive to
execute your search. The linear factor you are talking about is the number
of unique terms in your dictionary. This can easily take multiple seconds
to just build up the query on a small index. Even I'd not recommend to use
a trailing wildcard unless you really really need to. Such a query can
expand to hundreds of terms and you are basically calculating the
disjunction of those at query time.

By default leading wildcards are disabled in the lucene query parser
ElasticSearch allows it by default which is bad IMO. I'd totally set to to
false via the settings:
"indices.query.query_string.allowLeadingWildcard" : false

this could become very dangerous if you users can enter leading wildcards
directly!

simon

On Friday, July 20, 2012 10:11:10 AM UTC-4, such_mensch wrote:

What kind of tokenizers are the best for search for words with hyphens
in them??
Example: a search for "test" between following phrases {"test",
"test-2", "3-test"} should return all 3 and not just "test".


(simonw-2) #6

you can do this yourself with the ReverseStringFilter shipped with lucene
core. you might need to be little smarter with query creation and ideally
this would be build into ES directly but its all there and reasonable do do
if you need it

simon

On Friday, July 20, 2012 8:04:23 PM UTC+2, johno wrote:

Ideally, Lucene core would have a sensible wildcard implementation and
store reverse indexes -- because having linear scans for leading wildcard
searches (or for any searches, for that matter) is unacceptable. But, on
the scale of difficulty, between performing a complete re-index with
different tokenization parameters trumps performing a single linear scan.

On Friday, July 20, 2012 1:42:04 PM UTC-4, simonw wrote:

On Friday, July 20, 2012 6:48:57 PM UTC+2, johno wrote:

You can get away with not modifying the tokenization procedure by
dropping into lucene syntax and translating the query into 'test', but
it'll make the searching linear time with respect to shard size. If you
retokenized, you'd have to use proximity to find the whole phrase, which
could quite possibly be slower.

I would absolutely not recommend you to do this! Leading wildcards are 1.
very expensive to "rewrite" the query and 2. possible very expensive to
execute your search. The linear factor you are talking about is the number
of unique terms in your dictionary. This can easily take multiple seconds
to just build up the query on a small index. Even I'd not recommend to use
a trailing wildcard unless you really really need to. Such a query can
expand to hundreds of terms and you are basically calculating the
disjunction of those at query time.

By default leading wildcards are disabled in the lucene query parser
ElasticSearch allows it by default which is bad IMO. I'd totally set to to
false via the settings:
"indices.query.query_string.allowLeadingWildcard" : false

this could become very dangerous if you users can enter leading wildcards
directly!

simon

On Friday, July 20, 2012 10:11:10 AM UTC-4, such_mensch wrote:

What kind of tokenizers are the best for search for words with hyphens
in them??
Example: a search for "test" between following phrases {"test",
"test-2", "3-test"} should return all 3 and not just "test".


(such_mensch) #7

Thanks for the reply. Seems like the best idea to use the word-delimiter.
That filter causes one problem though: a phrase"te-st" might rank better
than "test" since the word-delimiter remove the hyphen. Any suggestions?

Am Freitag, 20. Juli 2012 16:45:57 UTC+2 schrieb simonw:

I'd likely use word-delimiter filter and preserve the original. This might
be the easiest and likely most effective solution for what you wanna do and
it's independent from tokenization. see this for documentation:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/word-delimiter-tokenfilter.html

I'd use generate_word_parts = true, catenate_words = true, preserve_original
= true
as a start..

simon

On Friday, July 20, 2012 4:11:10 PM UTC+2, such_mensch wrote:

What kind of tokenizers are the best for search for words with hyphens in
them??
Example: a search for "test" between following phrases {"test", "test-2",
"3-test"} should return all 3 and not just "test".


(Jörg Prante) #8

On Monday, July 23, 2012 11:18:34 AM UTC+2, such_mensch wrote:

Thanks for the reply. Seems like the best idea to use the word-delimiter.
That filter causes one problem though: a phrase"te-st" might rank better
than "test" since the word-delimiter remove the hyphen. Any suggestions?

Omitting term freqs, positions, and norms may
help. http://www.elasticsearch.org/guide/reference/mapping/core-types.html

Best,

Jörg


(simonw-2) #9

Hey,

On Monday, July 23, 2012 2:28:28 PM UTC+2, Jörg Prante wrote:

On Monday, July 23, 2012 11:18:34 AM UTC+2, such_mensch wrote:

Thanks for the reply. Seems like the best idea to use the word-delimiter.
That filter causes one problem though: a phrase"te-st" might rank better
than "test" since the word-delimiter remove the hyphen. Any suggestions?

Omitting term freqs, positions, and norms may help.
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

I think he is referring to the fact that if you "translate" te-st to test
in the filter it might get a better score since you keep to original and
hit both terms in a document. that might be possible but that seems ok no?
I mean if you have "a test document" you will get the same score for both
queries "test" & "te-st"

simon

Best,

Jörg


(such_mensch) #10

"test" should get a better score than "te-st" since it's an exact match.
Both "test" and "te-st" should be accepted as results just not with "te-st"
as the best one.

Am Dienstag, 24. Juli 2012 20:59:34 UTC+2 schrieb simonw:

Hey,

On Monday, July 23, 2012 2:28:28 PM UTC+2, Jörg Prante wrote:

On Monday, July 23, 2012 11:18:34 AM UTC+2, such_mensch wrote:

Thanks for the reply. Seems like the best idea to use the
word-delimiter. That filter causes one problem though: a phrase"te-st"
might rank better than "test" since the word-delimiter remove the hyphen.
Any suggestions?

Omitting term freqs, positions, and norms may help.
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

I think he is referring to the fact that if you "translate" te-st to test
in the filter it might get a better score since you keep to original and
hit both terms in a document. that might be possible but that seems ok no?
I mean if you have "a test document" you will get the same score for both
queries "test" & "te-st"

simon

Best,

Jörg


(simonw-2) #11

if you use WordDelimiter and preserve the original you should get a better
score for an exact match ie. test vs te-st due to the coord factor in the
similarity which measures the fraction of search terms that match in a
document. Yet, this factor depends on how you build the query ie. only
boolean scorers use that factor right now.

did you try it out in an example. the score difference should be measurable.

simon

On Wednesday, July 25, 2012 9:17:07 AM UTC+2, such_mensch wrote:

"test" should get a better score than "te-st" since it's an exact match.
Both "test" and "te-st" should be accepted as results just not with "te-st"
as the best one.

Am Dienstag, 24. Juli 2012 20:59:34 UTC+2 schrieb simonw:

Hey,

On Monday, July 23, 2012 2:28:28 PM UTC+2, Jörg Prante wrote:

On Monday, July 23, 2012 11:18:34 AM UTC+2, such_mensch wrote:

Thanks for the reply. Seems like the best idea to use the
word-delimiter. That filter causes one problem though: a phrase"te-st"
might rank better than "test" since the word-delimiter remove the hyphen.
Any suggestions?

Omitting term freqs, positions, and norms may help.
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

I think he is referring to the fact that if you "translate" te-st to test
in the filter it might get a better score since you keep to original and
hit both terms in a document. that might be possible but that seems ok no?
I mean if you have "a test document" you will get the same score for both
queries "test" & "te-st"

simon

Best,

Jörg


(system) #12