What is default index analyzer?


(Xudong You) #1

I though default analyzer is "standard" analyzer, but per my following experimentation, seems not.

  1. Create index with customized standard analyzer which included a pattern_capture filter to split words by "." or "_"
POST / myindex
{
      "settings" : {
        "analysis" : {
          "filter" : {
            "customsplit" : {
              "type" : "pattern_capture",
              "preserve_original" : 1,
              "patterns" : [
                "([^_.]+)"
              ]
            }
          },
          "analyzer" : {
            "standard" : {
              "tokenizer" : "standard",
              "filter" : [
                "lowercase",
                "customsplit"
              ]
            }
          }
        }
      },
      "mappings" : {
        "docs" : {
          "properties" : {
            "Url" : {
              "type" : "string"
            }
          }
        }
      }
    }
  1. Insert one doc to myindex
POST /myindex/docs/1
{
	"Url": "www.xyz.com"
}

Per _analyze API, the standard analyzer used by myindex DOES split the word by "."

GET /myindex/_analyze?analyzer=standard&text=www.xyz.com

output:

{
      "tokens": [
        {
          "token": "www.xyz.com",
          "start_offset": 0,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "www",
          "start_offset": 0,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "xyz",
          "start_offset": 0,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "com",
          "start_offset": 0,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 1
        }
      ]
    }

BUT, the problem is, if I search "xyz" from myindex, nothing returned:

POST /myindex/_search
{
 "query": {
  "match": {
    "Url": "xyz"
  }
 }
}

BUT, if I explicitly set the analyzer to "standard" in index mapping:

mappings": {
  "docs": {
    "properties": {
      "Url": {
        "type": "string",
        "analyzer": "standard"
      }
    }
  }

Then searching "xyz" can return the documents.

SO my question is: Is "standard" really default analyzer of ES index? if NOT, how to set default analyzer?

Or anything wrong in my above testing steps, if standard is indeed the default analyzer?


(David Pilato) #2

Please format your code using </> icon as explained in this guide. It will make your post more readable.

The standard analyzer is explained here: https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-standard-analyzer.html

As per https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis.html#_specifying_an_index_time_analyzer, the default analyzer is the standard one.


(Xudong You) #3

Thanks David.
Yes, per ES reference, "standard" analyzer should be default analyzer.

But then, anything wrong with my testing? Does the default "standard" used just mean the built-in standard analyzer, but not the customized "standard" as defined in my setting?


(David Pilato) #4

That's interesting.

Indeed, you can't here "overwrite" the standard analyzer which is built in elasticsearch.

The proper way to solve your issue for now is to do something like:

DELETE myindex
PUT myindex
{
  "settings": {
    "analysis": {
      "filter": {
        "customsplit": {
          "type": "pattern_capture",
          "preserve_original": 1,
          "patterns": [
            "([^_.]+)"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "customsplit"
          ]
        }
      }
    }
  },
  "mappings": {
    "docs": {
      "properties": {
        "Url": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT /myindex/docs/1
{
	"Url": "www.xyz.com"
}
GET /myindex/_analyze?analyzer=my_analyzer&text=www.xyz.com
POST /myindex/_search
{
 "query": {
  "match": {
    "Url": "xyz"
  }
 }
}

May be open an issue on github and refer to this ticket? I think that we should either reject that you are using standard as an analyzer name or pick the right one when running _search. Here I think we are using the built-in one at search time instead of the one which is defined within your index.

Thanks for reporting!


(Xudong You) #5

Thanks David.
Yes, I can resolve this issue by explicitly setting the "standard" analyzer to "Url" field.

This issue seems to me a bug of Elasticsearch. As you said, ES should either reject "standard" as an analyzer name in customized analyzer setting or pick the right one when running _search.

I will open this issue on GitHub, if not opened yet.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.