Question on stemming + synonyms and tokenizerFactory

I have an analysis chain like this for some Spanish text:
standard asciifolding lowercase es_stop_filter es_stem_filter es_synonyms

With synonyms at the end, after all the other filters, I have to define my
synonyms in their stemmed, ASCII-folded, lowercase forms. So instead of
defining a synonym set like "vacuna, vacunación, inmunización", I have to
define it as "vacun, vacunacion, inmunizacion".

In the case of a very aggressive stemmer like Snowball for English, we
would have to define "intern, global" as a synonym mapping when we'd really
want to write "international, global".

This is a little counter-intuitive for the folks who define our synonyms,
as they think in dictionary terms and not stemmed tokens, and need to have
access to a "standard asciifolding lowercase es_stop_filter es_stem_filter"
analysis chain to apply everything but the synonym filter in order to see
what tokens to specify in the synonyms file.

In this blog post
http://www.igate.com/iblog/index.php/stemming-and-synonyms-in-apache-solr/ about
Solr, the author mentions that one could define a "custom tokenizer that
returns the stemmed form of words from the synonyms file" to get around
this. Is it possible to configure Elasticsearch this way?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a7009182-9577-4580-872a-1b121be3457d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Once you have your mapping set up, then create an application that itself
constructs the analyzer you need. Then feed it your real words and let it
generate the stemmed versions.

I don't think that ES can be told to do this; but it provides the classes
you need to do it yourself.

For my own synonym processing, I do a Very Bad Thing. I create a synonym
_type and then each document contains a list of words or phrases that are
synonyms of each other. For a synonym query, I first query my synonym type.
Then I OR the queries for each of the matching synonym words or phrases.

This is also much easier to maintain: I can update the synonyms on the fly
and do not need to reindex the data at all. Not at all.

But it requires additional code, and it works best using the Java API. And
some folks have indicated there are serious performance issues making this
a Bad Solution. But I have not seen any problems with performance.

Oh, and all my words and phrases can be fully spelled out; it's only when
they are used in the subsequent query do they get analyzed (tokenized,
stemmed, and whatever else).

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e5a984d2-4f30-4e78-b1ba-1dc27febdfd3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.