We are having a field containing path information like: a>b>c>d
The paths usually have a length of 4 segments.
The field is defined both as a keyword field and analyzed text using the path-tokenizer.
Fielddata is enabled for the analyzed text field as we are running aggregations over the path segments.
I would have expected the size of fielddata for the text field to be at most 4 times that of the keyword field, as each value should result in 4 path entries like [a>b>c>d, a>b>c, a>b, a]
However, when using the cat API to list fielddata, I see about 100 kb for the keyword field (path.untouched), and 400 mb for the text field (path).
So, the ratio is about 4000 times larger.
Can anyone explain?
Here is the relevant excerpt from the mapping. Elasticsearch version is 6.4
"analysis": {
"analyzer": {
"path-analyzer": {
"type": "custom",
"tokenizer": "path-tokenizer",
"filter": "lowercase"
},
"lowercase-analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
},
"tokenizer": {
"path-tokenizer": {
"type": "path_hierarchy",
"delimiter": ">"
}
}
}
...
"properties": {
"path": {
"type": "text",
"analyzer": "path-analyzer",
"search_analyzer": "lowercase-analyzer",
"fielddata": true,
"fields": {
"search": {
"type": "text",
"analyzer": "standard"
},
"untouched": {
"type": "keyword"
}
}
},...