Best practices to match subwords in foreign languages

Arny_2 · March 21, 2013, 1:12pm

Hi,

I'm having some difficulties to match words which are inside larger words.
E.g. Elasticsearch: If I search for "search" it should match
"Elasticsearch".
In German we have a lot such words, like: Seidenchiffonbluse.
Now I want to match all words with "bluse".

Now I have read a lot of examples about partial word matching using ngram,
but to me this seems not the right way to go.
I don't want to match "blu", "blus" or anything the like.
Best way would be to provide a real dictionary of words and let
Elasticsearch strip it into words/tokens.

Are there any pre-defined language settings or dictionaries inside ES?
We store many language dependent texts inside one document which look like
this:
document : { EN : { title : "english title" }, DE : { title : "german
title" }, ....}

Would appreciate any help.

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 21, 2013, 3:53pm

Now I have read a lot of examples about partial word matching using
ngram, but to me this seems not the right way to go.
I don't want to match "blu", "blus" or anything the like.
Best way would be to provide a real dictionary of words and let
Elasticsearch strip it into words/tokens.

You're looking for the compound word token filter:

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Arny_2 · March 21, 2013, 5:05pm

this looks good.

Thanks

Am Donnerstag, 21. März 2013 16:53:04 UTC+1 schrieb Clinton Gormley:

Now I have read a lot of examples about partial word matching using
ngram, but to me this seems not the right way to go.
I don't want to match "blu", "blus" or anything the like.
Best way would be to provide a real dictionary of words and let
Elasticsearch strip it into words/tokens.

You're looking for the compound word token filter:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/compound-word-tokenfilter.html

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Part-of-word matching challenge Elasticsearch	4	558	March 13, 2017
Substring search Elasticsearch	2	464	July 6, 2017
Word matching (partial and full) Elasticsearch	5	1493	July 5, 2017
Spelling Matching Search Elasticsearch	4	352	July 6, 2017
What is required for partial match to work? Elasticsearch	6	584	July 6, 2017

Best practices to match subwords in foreign languages

Related topics