I have a document with the following description: "we have a lot of skin care products".
and when I'm trying to query with the word "skincare" I get score 0.
is there a tokenizer for this case?
to treat words like skincare - skin care, facemask - face mask, everybody - every body the same?
this is the index for description. I'm using the english tokenizer.
If your need is always find a splitted version of a single token terms, you can use at indexing time a synonym filter in order to index both single and multi-token form of each terms.
Thank you for the answer,
But in your solution I need to know in advanced the words.
In my project it's dynamic. I don't know what will be the token and the document I'll be searching on.
So I'm asking if there is a way to understand common words in english that can be written as multi words separated by whitespace and as one word like the example skincare & skin care
May be there are some solutions with NLP tools but with basic filters you need to build your own list by advance.
There are various sources of main English Compound Words on the net. https://www.google.com/search?q=compound+english+words+list
I've used "spaceless shingles" before now to avoid the need for synonym lists.
I can think of many word-pairs like "skin care" that, when collapsed into one word, mean the same thing but I struggle to think of examples where two words joined together (likethis) change their meaning as a result. This means you can generally index your shingles without a space and overcome your problem.
Note however that spaceless shingles are not without issues.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.