Cannot configure Multiple analyzers on the same field

I have a text field indexed with four different analyzers. So there is raw text, keyword, token count and text without stopwords.

PUT test 
{
    "settings": {
        "analysis": {
            "analyzer": {
               "myAnalyzer": { 
                  "type": "standard",
                  "stopwords": "_english_"
               }
             }
         }
  },
  "mappings": {
    "properties": {
        "my_text": {
          "type": "text",
          "analyzer": "standard",
          "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
            },
            "length": {
                "type": "token_count",
                "analyzer": "myAnalyzer",
                "store": true
            },
            "stop": {
                "type": "text",
                "analyzer": "myAnalyzer",
                "store": true
            }
        }
      }
    }
  }
}

However, when I retrieve these subfields, the stopwords are not removed and token count corresponds to the original text. Example:

PUT test/_doc/1
{
  "my_text": "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"`
}
GET test/_search
{
  "stored_fields": ["my_text.stop","my_text.length"],
  "_source": ["my_text"], 
    "query":{
      "match": {
        "my_text" : {
          "query" : "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
          
        }
      }
  }
}
// OUTPUT
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_score" : 3.6679466,
  "_source" : {
      "my_text" : "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
  },
  "fields" : {
    "my_text.length" : [12],
    "my_text.stop" : ["Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"]
   }
}

Notice that stopwords such as "of", "the" are not removed. Interestingly, when I use Analyze API, it applies my analyzers correctly. Example:

GET test/_analyze
{
  "analyzer": "myAnalyzer",
  "text": "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
}
//outputs 7 tokens [Raymond, Ruth, Perelman, School, Medicine, University, Pennsylvania]

Does anyone knows why my subfields with custom analyzers are not handled correctly during indexing?

They're not removed from what has been sent to elasticsearch. What you is what you sent whatever analyze happened later.

The only way to see the effect is to use _analyze API.

1 Like

Got it, thank you. Is it possible to return Token count with Analyze API? I can't figure out where to specify type: "token_count". It does not seem to be a part of analyzer object, but mapping instead. However Analyze API has no mappings.

No. It's not possible I think.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.