Tuning the default analyzers for indexing/searching

Hello,

so far the default settings for analyzers (i.e. they're untouched) worked for me in the general case but starts to show it's shortcomings.

I want to tune certain aspects, e.g. I want to split words on dots which aren't by default:

GET _analyze?text=foo.bar.baz

{
   "tokens": [
      {
         "token": "foo.bar.baz",
         "start_offset": 0,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 1
      }
   ]
}

I noticed I get this output also when providing an explicit analyzer, e.g. &analyzer=default or &analyzer=standard yields the same result.

Now, in general it's not problem: the documentation about this is OK and I've experience with it.

Basically, the defaults are fine and I just want to change them "a bit". As such, I want to replicate their current settings of the default/(standard?) analyzer exactly as they are and only adjust the parts I want to change.

I'm looking at Standard Analyzer | Elasticsearch Guide [1.5] | Elastic and I think this might be want I want => but I'm unable to derive what the actual definition of the analyzer is so I could build my custom one.

Or is it possible tune default/standard analyzer? The docs above say e.g. for the "Standard Token Filter":

The standard token filter currently does nothing. It remains as a placeholder in case some filtering function needs to be added in a future version

Does "future" mean I can customize it?

I'm totally fine writing a complete custom analyzer as long as I somehow can verify that the index/search definitions match exactly what is active by default plus the things I want to add. This is important for me as I don't want to mess things up by changing the way things are analyzed in unexpected ways without me noticing it.

thanks for any pointers,

  • Markus

PS: I'm still using 1.5, just waiting for the big 5 release to upgrade :slight_smile:

FTR, my current approach, given the docs, is this setting with which I'm testing now:

{
  "ThatIndex": {
    "settings": {
      "index": {
        "analysis": {
          "analyzer": {
            "default": {
              "filter": [
                "standard",
                "lowercase",
                "stop"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        }
      }
    }
  }
}

Your current approach works and can be ported to 5.0 with a few changes. In 1.5 you can specify different default analyzers for search and indexing. See docs for details. Starting with 2.3 this behavior is slightly different though - you no longer have default_index.

To compare the behavior of the old default analyzer and the new custom default analyzer defined for your index, you can use two version of analyzer API. For old (global) default - don't specify the index name:

GET _analyze?text=foo.bar.baz

For the new default analyzer that you specified for your index use

GET ThatIndex/_analyze?text=foo.bar.baz