Difficulties in searching in strings where words are separated by dots, underscores and hyphens

Radim_Bukovsky · October 7, 2014, 7:28am

Hi,
we use ElasticSearch in our file-sharing website. Our aim is to deliver the
best search results as we can. We have now implemented ElasticSearch -
Fuzzy Like This Query. But I am not convinced that is the best way for us.
We have one search field for users. Keywords are searched in names and
descriptions of uploaded files. Using Fuzzy Like This Query give us great
results when description or filename (example: keyword is "trance", it
finds "trance 2014", "Best trance Ever" etc.).
Difficulties are with filenames/descriptions where the keywords are
separated by dots (.), underscores (_) or hyphen (-) (example: keyword is
"trance", it does not find "Best.trance.Ever" or "Best-trance-Ever" or
"Best_trance_Ever"). Do you have any advice for this?
Is it possible to solve it within Fuzzy Like This Query?

Thank you very much for your help.

Best regards,
Radim Bukovsky

Settings (Fuzzy Like This Query):
addFields(array('name', 'description'));
setLikeText($file->name);
setMaxQueryTerms(12);
setMinSimilarity(1);

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9cdd5b1f-a728-44e5-9efb-c85251a00100%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lukas_Vlcek1 · October 7, 2014, 7:41am

Hi,

this sound like an issue with analyzer to me.
Can you share more info about analyzer configuration for your 'name' and
'description' fields?

Regards,
Lukas

On Tue, Oct 7, 2014 at 9:28 AM, Radim Bukovský rad.bukovsky@seznam.cz
wrote:

Hi,
we use Elasticsearch in our file-sharing website. Our aim is to deliver
the best search results as we can. We have now implemented Elasticsearch -
Fuzzy Like This Query. But I am not convinced that is the best way for us.
We have one search field for users. Keywords are searched in names and
descriptions of uploaded files. Using Fuzzy Like This Query give us great
results when description or filename (example: keyword is "trance", it
finds "trance 2014", "Best trance Ever" etc.).
Difficulties are with filenames/descriptions where the keywords are
separated by dots (.), underscores (_) or hyphen (-) (example: keyword is
"trance", it does not find "Best.trance.Ever" or "Best-trance-Ever" or
"Best_trance_Ever"). Do you have any advice for this?
Is it possible to solve it within Fuzzy Like This Query?

Thank you very much for your help.

Best regards,
Radim Bukovsky

Settings (Fuzzy Like This Query):
addFields(array('name', 'description'));
setLikeText($file->name);
setMaxQueryTerms(12);
setMinSimilarity(1);

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9cdd5b1f-a728-44e5-9efb-c85251a00100%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9cdd5b1f-a728-44e5-9efb-c85251a00100%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAO9cvUbBiHQiHO7JEoPODOKVbaWo5M1MqMp95G8k%3DdubKyNnHA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Radim_Bukovsky · October 7, 2014, 2:24pm

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')'); ------------------->
('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · October 7, 2014, 4:55pm

Your issue is with the standard tokenizer, which will tokenize on most non
word characters.

Try using a whitespace or pattern tokenizer, which is dependent on your use
case.

--
Ivan
On Oct 7, 2014 7:24 AM, "Radim Bukovský" rad.bukovsky@seznam.cz wrote:

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')'); ------------------->
('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · October 7, 2014, 6:36pm

If you still need the standard analyzer's behavior for words but want to
force separation on stuff containing dots and underscores you can use the
mapper character filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
to convert those characters to spaces. Its pretty crude but it should
work. You can also use the word delimiter token filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
but it has more weird corner cases.

Nik

On Tue, Oct 7, 2014 at 12:55 PM, Ivan Brusic ivan@brusic.com wrote:

Your issue is with the standard tokenizer, which will tokenize on most non
word characters.

Try using a whitespace or pattern tokenizer, which is dependent on your
use case.

--
Ivan
On Oct 7, 2014 7:24 AM, "Radim Bukovský" rad.bukovsky@seznam.cz wrote:

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')'); ------------------->
('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3-RBLstc0adGt9KyHzc%3DRWnAHP2LNN%2Bv-psbHDZAtO_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Bryan_Warner · October 8, 2014, 2:07pm

To be able to customize the default word boundary properties in Lucene's
StandardTokenizer, I created an Elasticsearch plugin to be able to do this

GitHub - bbguitar77/elasticsearch-analysis-standardext

As mentioned before, there are other tokenizers / filters that can be used,
but each one has its own drawbacks. Plus, you'll forgo all the benefits of
the StandardTokenizer (e.g. dealing with invisible characters that aren't
considered whitespace, being able to define conditional word break rules
for punctuation symbols with respect to how they appear in the text).

Hope this helps

Bryan

On Tuesday, October 7, 2014 2:37:34 PM UTC-4, Nikolas Everett wrote:

If you still need the standard analyzer's behavior for words but want to
force separation on stuff containing dots and underscores you can use the
mapper character filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
to convert those characters to spaces. Its pretty crude but it should
work. You can also use the word delimiter token filter (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
but it has more weird corner cases.

Nik

On Tue, Oct 7, 2014 at 12:55 PM, Ivan Brusic <iv...@brusic.com
<javascript:>> wrote:

Your issue is with the standard tokenizer, which will tokenize on most
non word characters.

Try using a whitespace or pattern tokenizer, which is dependent on your
use case.

--
Ivan
On Oct 7, 2014 7:24 AM, "Radim Bukovský" <rad.bu...@seznam.cz
<javascript:>> wrote:

Hi,
I am not sure if I understand what do you think. Could you be more
specific? Many thanks.

Settings:

setFields(array('name^3', 'description')); ------------------>
('fields')
setQuery($search . ' extension:('.$filter.')');
-------------------> ('query')
setDefaultOperator('AND'); ------------------>
('default_operator')

There is list of all parameters if helps:

'allow_leading_wildcard' ------------> true
'enable_position_increments' -------> true
'lowercase_expanded_terms' -------> true
'fuzzy_prefix_length' ---------> 0
'fuzzy_min_sim' ---------> 0.5
'phrase_slop' ---------> 0
'boost' --------> 1.0
'analyze_wildcard' ----------> true
'auto_generate_phrase_queries' ------------ > true
'use_dis_max' ----> true
'tie_breaker' ------> 0
'rewrite' -----------> ""

Regards,
Radim

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c3f9aff-2971-49c2-afc5-d1054d1068fe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCU-%3DxwRxsdHREWh59kp_wJhdHbRosK8Co9njLhgR%2BZbQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f4c26ea4-928e-4079-b170-c11a9bf38028%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Hyphen search Elasticsearch	11	5022	July 6, 2017
Changing Analyzer behavior for hyphens - suggestions? Elasticsearch	7	12006	July 5, 2017
EL setup for fulltext search Elasticsearch	11	592	July 6, 2017
Dot not used as delimiter Elasticsearch	4	2166	July 6, 2017
Analyzer, Fuzzy Query? Elasticsearch	7	2363	July 6, 2017

Difficulties in searching in strings where words are separated by dots, underscores and hyphens

Related topics