Difficulties in searching in strings where words are separated by dots, underscores and hyphens

Hi,
we use ElasticSearch in our file-sharing website. Our aim is to deliver the
best search results as we can. We have now implemented ElasticSearch -
Fuzzy Like This Query. But I am not convinced that is the best way for us.
We have one search field for users. Keywords are searched in names and
descriptions of uploaded files. Using Fuzzy Like This Query give us great
results when description or filename (example: keyword is "trance", it
finds "trance 2014", "Best trance Ever" etc.).
Difficulties are with filenames/descriptions where the keywords are
separated by dots (.), underscores (_) or hyphen (-) (example: keyword is
"trance", it does not find "Best.trance.Ever" or "Best-trance-Ever" or
"Best_trance_Ever"). Do you have any advice for this?
Is it possible to solve it within Fuzzy Like This Query?

Thank you very much for your help.

Best regards,
Radim Bukovsky

Settings (Fuzzy Like This Query):
addFields(array('name', 'description'));
setLikeText($file->name);
setMaxQueryTerms(12);
setMinSimilarity(1);

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9cdd5b1f-a728-44e5-9efb-c85251a00100%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

this sound like an issue with analyzer to me.
Can you share more info about analyzer configuration for your 'name' and
'description' fields?

Regards,
Lukas

On Tue, Oct 7, 2014 at 9:28 AM, Radim Bukovský rad.bukovsky@seznam.cz
wrote:

Hi,
we use Elasticsearch in our file-sharing website. Our aim is to deliver
the best search results as we can. We have now implemented Elasticsearch -
Fuzzy Like This Query. But I am not convinced that is the best way for us.
We have one search field for users. Keywords are searched in names and
descriptions of uploaded files. Using Fuzzy Like This Query give us great
results when description or filename (example: keyword is "trance", it
finds "trance 2014", "Best trance Ever" etc.).
Difficulties are with filenames/descriptions where the keywords are
separated by dots (.), underscores (_) or hyphen (-) (example: keyword is
"trance", it does not find "Best.trance.Ever" or "Best-trance-Ever" or
"Best_trance_Ever"). Do you have any advice for this?
Is it possible to solve it within Fuzzy Like This Query?

Thank you very much for your help.

Best regards,
Radim Bukovsky

Settings (Fuzzy Like This Query):
addFields(array('name', 'description'));
setLikeText($file->name);
setMaxQueryTerms(12);
setMinSimilarity(1);

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9cdd5b1f-a728-44e5-9efb-c85251a00100%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9cdd5b1f-a728-44e5-9efb-c85251a00100%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAO9cvUbBiHQiHO7JEoPODOKVbaWo5M1MqMp95G8k%3DdubKyNnHA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')'); ------------------->
('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Your issue is with the standard tokenizer, which will tokenize on most non
word characters.

Try using a whitespace or pattern tokenizer, which is dependent on your use
case.

--
Ivan
On Oct 7, 2014 7:24 AM, "Radim Bukovský" rad.bukovsky@seznam.cz wrote:

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')'); ------------------->
('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

If you still need the standard analyzer's behavior for words but want to
force separation on stuff containing dots and underscores you can use the
mapper character filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
to convert those characters to spaces. Its pretty crude but it should
work. You can also use the word delimiter token filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
but it has more weird corner cases.

Nik

On Tue, Oct 7, 2014 at 12:55 PM, Ivan Brusic ivan@brusic.com wrote:

Your issue is with the standard tokenizer, which will tokenize on most non
word characters.

Try using a whitespace or pattern tokenizer, which is dependent on your
use case.

--
Ivan
On Oct 7, 2014 7:24 AM, "Radim Bukovský" rad.bukovsky@seznam.cz wrote:

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')'); ------------------->
('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3-RBLstc0adGt9KyHzc%3DRWnAHP2LNN%2Bv-psbHDZAtO_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

To be able to customize the default word boundary properties in Lucene's
StandardTokenizer, I created an Elasticsearch plugin to be able to do this

As mentioned before, there are other tokenizers / filters that can be used,
but each one has its own drawbacks. Plus, you'll forgo all the benefits of
the StandardTokenizer (e.g. dealing with invisible characters that aren't
considered whitespace, being able to define conditional word break rules
for punctuation symbols with respect to how they appear in the text).

Hope this helps

  • Bryan

On Tuesday, October 7, 2014 2:37:34 PM UTC-4, Nikolas Everett wrote:

If you still need the standard analyzer's behavior for words but want to
force separation on stuff containing dots and underscores you can use the
mapper character filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
to convert those characters to spaces. Its pretty crude but it should
work. You can also use the word delimiter token filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
but it has more weird corner cases.

Nik

On Tue, Oct 7, 2014 at 12:55 PM, Ivan Brusic <iv...@brusic.com
<javascript:>> wrote:

Your issue is with the standard tokenizer, which will tokenize on most
non word characters.

Try using a whitespace or pattern tokenizer, which is dependent on your
use case.

--
Ivan
On Oct 7, 2014 7:24 AM, "Radim Bukovský" <rad.bu...@seznam.cz
<javascript:>> wrote:

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')');
-------------------> ('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f4c26ea4-928e-4079-b170-c11a9bf38028%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.