Illustration of DocValues, Fielddata and Inverted Index


I would like to better understand internal Elasticsearch data structures, more specifically 1) inverted index 2) fielddata and 3) DocValues
Assume we have the following use case:

Sample data 1:
We’re given a "features" field that contains tilde separated letters (each being a feature).

        "docID": 1,
     "features": "A~B~C"
		"docID": 2,
     "features": "A~C"
		"docID": 3,
     "features": "A~C"

We want to be able to aggregate the features individually across an index:

			"key": "A",
			"doc_count": 3
			"key": "B",
			"doc_count": 1
			"key": "C",
			"doc_count": 3

And aggregate them as they occurred combinatorially (i.e. together):

			"key": "A~B~C",
			"doc_count": 1
			"key": "A~C",
			"doc_count": 2

Mapping 1:

"mappings": {
   "properties": {
      "features" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword"
                "analyzer" : "feature_analyzer",
                "fielddata" : true

Where the feature analyzer is a simple split along tildes.

Could you kindly confirm/correct the following data structure hypotheses for the sample data above ?:


DocID Term
1 A~B~C
2 A~C
3 A~C

Inverted Index:

Term DocID
A 1, 2, 3
B 1
C 1, 2, 3

Fielddata, two hypothesis - Inverting the inverted index, what does it really do?

Fielddata - 1:

DocID Terms
1, 2, 3 A
1 B
1, 2, 3 C

Fielddata - 2:

DocID Terms
1 A, B, C or [A, B, C]
2 A, C or [A, C]
3 A, C or [A, C]

We wish to avoid using fielddata.
Now let’s suppose instead of having a multi-field, we duplicate the "features" field, once stored as tilde separated keywords and once as an array:

Let's take the following example:

Sample data 2:

        "docID": 1,
     "features": "A~B~C",
  "features-array": [A,B,C]
		"docID": 2,
     "features": "A~C",
  "features-array": [A,C]
		"docID": 3,
     "features": "A~C",
  "features-array": [A,C]

Mapping 2:

"mappings": {
   "properties": {
       "features": {
         "type" : "keyword"
      "features-array": {
        "type": "keyword"

What are the associated data structures in this case ? Could you kindly confirm confirm/correct the following and illustrate the data structures for the "features-array" field?

  • For "features" field:


DocID Term
1 A~B~C
2 A~C
3 A~C

Inverted Index:

Term DocID
A~B~C 1
A~C 2, 3
  • For "features-array" field I'm completely unsure


Given the latter mapping, why can’t the following aggregation identify unique arrays?

GET example-index/_search
  	"size": 0, 
 	"aggs": {
       "features": {
         "terms": {
           "field": "features-array"

True Result:

          "key" : "A",
          "doc_count" : 3
          "key" : "B",
          "doc_count" : 1
          "key" : "C",
          "doc_count" : 3

Desired Results:

          "key" : [“A”, “B”, “C”],
          "doc_count" : 1
          "key" : [“A”, “C”],,
          "doc_count" : 2

Let me know if you need any clarifications regarding my questions.
Thanks !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.