I have a text field indexed with four different analyzers. So there is raw text, keyword, token count and text without stopwords.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "standard",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"length": {
"type": "token_count",
"analyzer": "myAnalyzer",
"store": true
},
"stop": {
"type": "text",
"analyzer": "myAnalyzer",
"store": true
}
}
}
}
}
}
However, when I retrieve these subfields, the stopwords are not removed and token count corresponds to the original text. Example:
PUT test/_doc/1
{
"my_text": "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"`
}
GET test/_search
{
"stored_fields": ["my_text.stop","my_text.length"],
"_source": ["my_text"],
"query":{
"match": {
"my_text" : {
"query" : "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
}
}
}
}
// OUTPUT
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.6679466,
"_source" : {
"my_text" : "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
},
"fields" : {
"my_text.length" : [12],
"my_text.stop" : ["Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"]
}
}
Notice that stopwords such as "of", "the" are not removed. Interestingly, when I use Analyze API, it applies my analyzers correctly. Example:
GET test/_analyze
{
"analyzer": "myAnalyzer",
"text": "Raymond and Ruth Perelman School of Medicine at the University of Pennsylvania"
}
//outputs 7 tokens [Raymond, Ruth, Perelman, School, Medicine, University, Pennsylvania]
Does anyone knows why my subfields with custom analyzers are not handled correctly during indexing?