My folks and I are encountering a problem with our current analyzer. Indeed, we analyze user-generated content and the stemming process generates many false positives.
For instance, the company "Servier" stems to "servi" which matches the word "Service". To avoid that, we would like to use a dictionnary based lemmatizer but I did not manage to find one.
Is there any french lemmatizer I can use (in production) with ES ?
I don't know of any! If you are willing to fiddle with it you can recreate the french analyzer using the code here and then add a stemmer_override filter to prevent the company from being stemmed.
Thanks for your answer unfortunatly we would rather not use term exclusion since we would have to exclude a lot of proper nouns. It is probably possible but not very reallistic nor maintenable
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.