Cannot configure Multiple analyzers on the same field

germax · July 19, 2019, 1:11am

I have a text field indexed with four different analyzers. So there is raw text, keyword, token count and text without stopwords.

PUT test 
{
    "settings": {
        "analysis": {
            "analyzer": {
               "myAnalyzer": { 
                  "type": "standard",
                  "stopwords": "_english_"
               }
             }
         }
  },
  "mappings": {
    "properties": {
        "my_text": {
          "type": "text",
          "analyzer": "standard",
          "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
            },
            "length": {
                "type": "token_count",
                "analyzer": "myAnalyzer",
                "store": true
            },
            "stop": {
                "type": "text",
                "analyzer": "myAnalyzer",
                "store": true
            }
        }
      }
    }
  }
}

However, when I retrieve these subfields, the stopwords are not removed and token count corresponds to the original text. Example:

PUT test/_doc/1
{
  "my_text": "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"`
}

GET test/_search
{
  "stored_fields": ["my_text.stop","my_text.length"],
  "_source": ["my_text"], 
    "query":{
      "match": {
        "my_text" : {
          "query" : "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
          
        }
      }
  }
}
// OUTPUT
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_score" : 3.6679466,
  "_source" : {
      "my_text" : "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
  },
  "fields" : {
    "my_text.length" : [12],
    "my_text.stop" : ["Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"]
   }
}

Notice that stopwords such as "of", "the" are not removed. Interestingly, when I use Analyze API, it applies my analyzers correctly. Example:

GET test/_analyze
{
  "analyzer": "myAnalyzer",
  "text": "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
}
//outputs 7 tokens [Raymond, Ruth, Perelman, School, Medicine, University, Pennsylvania]

Does anyone knows why my subfields with custom analyzers are not handled correctly during indexing?

dadoonet · July 19, 2019, 6:43am

They're not removed from what has been sent to elasticsearch. What you is what you sent whatever analyze happened later.

The only way to see the effect is to use _analyze API.

germax · July 19, 2019, 1:37pm

Got it, thank you. Is it possible to return Token count with Analyze API? I can't figure out where to specify type: "token_count". It does not seem to be a part of analyzer object, but mapping instead. However Analyze API has no mappings.

dadoonet · July 19, 2019, 2:07pm

No. It's not possible I think.

system · August 16, 2019, 2:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to add Multiple analyzers to a field Elasticsearch	3	16766	December 7, 2018
Analizer with stop words removal by language Elasticsearch	5	464	July 6, 2017
Use multiple analyzers by field on query Elasticsearch	6	1143	October 26, 2021
Analyzer plugin needs access to multiple fields Elasticsearch	2	431	July 5, 2017
Analyzer selection on multi-field Elasticsearch	2	381	July 6, 2017

Cannot configure Multiple analyzers on the same field

Related topics