I am looking into the synonym token filter and the Elasticsearch documentation recommends that when you work with large synonym datasets you should set the synonyms_path to a file over inserting synonyms directly into the configuration file.
But the documentation tells not why. Is it just because of maintainability? That you do not want scroll through thousands of synonyms to check your configuration/mapping?
And how does Elastcisearch handle this synonym file? Are the contents of the file loaded into memory after an configuration update?
I'm keen to know an answer to this as well. Provisioning synonyms through index settings is way easier (done through REST API) than provisioning through files.
@dadoonet: are you sure that all synonyms get sent along with cluster state? It doesn't seem logical to me since synonyms are part of the index settings, not the cluster state. Sending along all filters with cluster state seems weird.
I'm unsure if it's better. TBH I'd really love it to be loaded from a document stored into an elasticsearch index than from the file system. Because, it's harder to maintain on the FS and distribute on all nodes a consistent file.
Years ago I wrote a collection of token filters that read its values from a
database. Been in production all this time. Always wanted to reboot that
project into a public release, but it was held up due to a change in the
way analyzers are created/stored in Elasticsearch. The change was pushed
into the 3.x (now 5.x) branch. Since 5.0 will be released soon (alpha at
least), I should revisit the project.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.