Query connected and seperated words (ie: everybody - every body)

ziv · August 18, 2020, 11:18am

I have a document with the following description: "we have a lot of skin care products".
and when I'm trying to query with the word "skincare" I get score 0.
is there a tokenizer for this case?
to treat words like skincare - skin care, facemask - face mask, everybody - every body the same?

this is the index for description. I'm using the english tokenizer.

"description":{ 
            "type":"text",
            "analyzer":"english",
            "fields":{ 
               "keyword":{ 
                  "type":"keyword",
                  "ignore_above":256
               }
            }
         }

Thanks!

dominique.bejean · August 18, 2020, 9:34pm

Hi,

If your need is always find a splitted version of a single token terms, you can use at indexing time a synonym filter in order to index both single and multi-token form of each terms.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

Regards

Dominique

ziv · August 19, 2020, 5:46am

Thank you for the answer,
But in your solution I need to know in advanced the words.
In my project it's dynamic. I don't know what will be the token and the document I'll be searching on.
So I'm asking if there is a way to understand common words in english that can be written as multi words separated by whitespace and as one word like the example skincare & skin care

Thanks!

dominique.bejean · August 19, 2020, 7:25am

May be there are some solutions with NLP tools but with basic filters you need to build your own list by advance.
There are various sources of main English Compound Words on the net.
https://www.google.com/search?q=compound+english+words+list

Dominique

ziv · August 19, 2020, 11:20am

Thanks!
I'll look into it

Mark_Harwood · August 19, 2020, 11:32am

I've used "spaceless shingles" before now to avoid the need for synonym lists.

I can think of many word-pairs like "skin care" that, when collapsed into one word, mean the same thing but I struggle to think of examples where two words joined together (likethis) change their meaning as a result. This means you can generally index your shingles without a space and overcome your problem.

Note however that spaceless shingles are not without issues.

system · September 16, 2020, 11:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokenizers for common words Elasticsearch	1	280	May 23, 2018
Shingle filter to allow mismatching spaces Elasticsearch	5	1430	November 30, 2020
Looking for a phrase tokenizer or filter like this Elastic Search	4	234	November 2, 2022
Search with whitespace again Elasticsearch	3	5240	July 6, 2017
Problem with synonym token filter Elasticsearch	8	460	July 6, 2017

Query connected and seperated words (ie: everybody - every body)

Related topics