Altering the standard analyzer


(lasseschou) #1

I want to create custom analyzers for my solution. One of them needs to be very close to the standard analyzer, but handle dots (.) differently.

I've read the documentation on the built-in analyzers and how to create custom analyzers. But what I'm missing is the following:

  • An exact description of the Standard analyzer, including character filters, tokenizers and token filters.
  • An exact description of the Simple analyzer, including character filters, tokenizers and token filters.

Ideally, I'd like to see a complete PUT /my-index {"settings", ...} call (like in the custom analyzers doc).

Thanks!


(Robbie Ogburn) #2

Standard Analyzer - built using the Standard Tokenizer with the Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. I could recreate it like so:

{
    "type":      "custom",
    "tokenizer": "standard",
    "filter":  [ "standard", "lowercase", "stop" ]
}

Simple Analyzer - built using a Lower Case Tokenizer. I could recreate it like so:

{
    "type":      "custom",
    "tokenizer": "lowercase"
}

Putting it into a complete example:

PUT /my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_standard": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "stop"
          ]
        },
        "custom_simple": {
          "type": "custom",
          "tokenizer": "lowercase"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "field1": {
          "type": "string",
          "analyzer": "custom_standard"
        },
        "field2": {
          "type": "string",
          "analyzer": "custom_simple"
        }
      }
    }
  }
}

(lasseschou) #3

Thanks so much, very helpful.

I want to create a clone of the standard analyzer, the only difference
being that it tokenizes words with '.' inside. Example:

www.test.com

Should be tokenized into www, test and com.

Can you help me create the mapping code for that? Thanks!

Lasse

Den tirsdag den 22. september 2015 skrev Robbie Ogburn <
noreply@discuss.elastic.co>:


(system) #4