The reason I ask this is because I'm working on a project to index some documents, which may be in a variety of languages.
Choosing the default analysers is easy enough. But then I recently decided that I want to slightly tweak the tokeniser , so that it is a pattern type, with "[\W\.]+"
, i.e. so that dot (".") becomes another token separator.
My understanding now is that you can't inject a different tokeniser into an existing off-the-shelf analyser. Instead you have to build your own from the ground up, and there are pages to help you do this in the manual here.
Configuring 20 (for example) languages means quite a big JSON object going to the "_settings" endpoint. That's not necessarily a big deal.
But it just occurred to me: given that I think a "dot-and-word-boundary" tokeniser is something I want to be my "default" for each of these languages, is there any way of doing that? I.e. whether my customised French analyser is called "french" (replacing the default French analyser), or whether it's called "my_dot_sensitive_french_analyser", is there any way of putting this on a persistent basis into my ES server memory so it's then made available to multiple projects?