Recommendation for large synonym file


#1

Hi all,

I am looking into the synonym token filter and the Elasticsearch documentation recommends that when you work with large synonym datasets you should set the synonyms_path to a file over inserting synonyms directly into the configuration file.

But the documentation tells not why. Is it just because of maintainability? That you do not want scroll through thousands of synonyms to check your configuration/mapping?

And how does Elastcisearch handle this synonym file? Are the contents of the file loaded into memory after an configuration update?

Thanks in advance!

Regards,

Tim


Large synonyms file - Cluster State Exception
(David Pilato) #2

Probably because the cluster state which contains index settings will become too big?


(Frank) #3

I'm keen to know an answer to this as well. Provisioning synonyms through index settings is way easier (done through REST API) than provisioning through files.

@dadoonet: are you sure that all synonyms get sent along with cluster state? It doesn't seem logical to me since synonyms are part of the index settings, not the cluster state. Sending along all filters with cluster state seems weird.


(David Pilato) #4

Index settings are part of the index metadata and index metadata is part of the cluster state.


(Frank) #5

Alright, thanks for that! Putting them in a file seems better then :slightly_smiling:


#6

Thanks for the clarification!


(David Pilato) #7

I'm unsure if it's better. TBH I'd really love it to be loaded from a document stored into an elasticsearch index than from the file system. Because, it's harder to maintain on the FS and distribute on all nodes a consistent file.

I opened this feature request. Will see where it goes: https://github.com/elastic/elasticsearch/issues/16824


(Ivan Brusic) #8

Years ago I wrote a collection of token filters that read its values from a
database. Been in production all this time. Always wanted to reboot that
project into a public release, but it was held up due to a change in the
way analyzers are created/stored in Elasticsearch. The change was pushed
into the 3.x (now 5.x) branch. Since 5.0 will be released soon (alpha at
least), I should revisit the project.

Ivan


(system) #9