Understanding fielddata size

We are having a field containing path information like: a>b>c>d
The paths usually have a length of 4 segments.

The field is defined both as a keyword field and analyzed text using the path-tokenizer.
Fielddata is enabled for the analyzed text field as we are running aggregations over the path segments.

I would have expected the size of fielddata for the text field to be at most 4 times that of the keyword field, as each value should result in 4 path entries like [a>b>c>d, a>b>c, a>b, a]

However, when using the cat API to list fielddata, I see about 100 kb for the keyword field (path.untouched), and 400 mb for the text field (path).

So, the ratio is about 4000 times larger.

Can anyone explain?

Here is the relevant excerpt from the mapping. Elasticsearch version is 6.4

"analysis": {
	"analyzer": {
		"path-analyzer": {
			"type": "custom",
			"tokenizer": "path-tokenizer",
			"filter": "lowercase"
		},
	"lowercase-analyzer": {
		"type": "custom",
		"tokenizer": "keyword",
		"filter": "lowercase"
	}
	},
	"tokenizer": {
		"path-tokenizer": {
			"type": "path_hierarchy",
			"delimiter": ">"
		}
	}
}
...
"properties": {
	"path": {
		"type": "text",
		"analyzer": "path-analyzer",
		"search_analyzer": "lowercase-analyzer",
		"fielddata": true,
		"fields": {
			"search": {
				"type": "text",
				"analyzer": "standard"
			},
			"untouched": {
				"type": "keyword"
			}
		}
	},...

Is there nobody that can help me with this excessive memory usage by fielddata?

In a production environment, fielddata occupies about 2gb for that single field.
That really is an issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.