I have an analysis chain like this for some Spanish text:
standard asciifolding lowercase es_stop_filter es_stem_filter es_synonyms
With synonyms at the end, after all the other filters, I have to define my
synonyms in their stemmed, ASCII-folded, lowercase forms. So instead of
defining a synonym set like "vacuna, vacunación, inmunización", I have to
define it as "vacun, vacunacion, inmunizacion".
In the case of a very aggressive stemmer like Snowball for English, we
would have to define "intern, global" as a synonym mapping when we'd really
want to write "international, global".
This is a little counter-intuitive for the folks who define our synonyms,
as they think in dictionary terms and not stemmed tokens, and need to have
access to a "standard asciifolding lowercase es_stop_filter es_stem_filter"
analysis chain to apply everything but the synonym filter in order to see
what tokens to specify in the synonyms file.
In this blog post
http://www.igate.com/iblog/index.php/stemming-and-synonyms-in-apache-solr/ about
Solr, the author mentions that one could define a "custom tokenizer that
returns the stemmed form of words from the synonyms file" to get around
this. Is it possible to configure Elasticsearch this way?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a7009182-9577-4580-872a-1b121be3457d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.