Illustration of DocValues, Fielddata and Inverted Index

Hello,

I would like to better understand internal Elasticsearch data structures, more specifically 1) inverted index 2) fielddata and 3) DocValues
Assume we have the following use case:

Sample data 1:
We’re given a "features" field that contains tilde separated letters (each being a feature).

{
        "docID": 1,
     "features": "A~B~C"
},
{
		"docID": 2,
     "features": "A~C"
},
{
		"docID": 3,
     "features": "A~C"
}

We want to be able to aggregate the features individually across an index:

{
			"key": "A",
			"doc_count": 3
},
{
			"key": "B",
			"doc_count": 1
},
{
			"key": "C",
			"doc_count": 3
}

And aggregate them as they occurred combinatorially (i.e. together):

{
			"key": "A~B~C",
			"doc_count": 1
},
{
			"key": "A~C",
			"doc_count": 2
}

Mapping 1:

"mappings": {
   "properties": {
      "features" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword"
                  }
                },
                "analyzer" : "feature_analyzer",
                "fielddata" : true
         }
    }
}

Where the feature analyzer is a simple split along tildes.

QUESTION:
Could you kindly confirm/correct the following data structure hypotheses for the sample data above ?:

DocValues:

DocID Term
1 A~B~C
2 A~C
3 A~C

Inverted Index:

Term DocID
A 1, 2, 3
B 1
C 1, 2, 3

Fielddata, two hypothesis - Inverting the inverted index, what does it really do?

Fielddata - 1:

DocID Terms
1, 2, 3 A
1 B
1, 2, 3 C

Fielddata - 2:

DocID Terms
1 A, B, C or [A, B, C]
2 A, C or [A, C]
3 A, C or [A, C]

We wish to avoid using fielddata.
Now let’s suppose instead of having a multi-field, we duplicate the "features" field, once stored as tilde separated keywords and once as an array:

Let's take the following example:

Sample data 2:

{
        "docID": 1,
     "features": "A~B~C",
  "features-array": [A,B,C]
},
		"docID": 2,
     "features": "A~C",
  "features-array": [A,C]
},
{
		"docID": 3,
     "features": "A~C",
  "features-array": [A,C]
}

Mapping 2:

"mappings": {
   "properties": {
       "features": {
         "type" : "keyword"
   },
      "features-array": {
        "type": "keyword"
   }
  }
}

QUESTION:
What are the associated data structures in this case ? Could you kindly confirm confirm/correct the following and illustrate the data structures for the "features-array" field?

  • For "features" field:

DocValues:

DocID Term
1 A~B~C
2 A~C
3 A~C

Inverted Index:

Term DocID
A~B~C 1
A~C 2, 3
  • For "features-array" field I'm completely unsure

QUESTION:

Given the latter mapping, why can’t the following aggregation identify unique arrays?

GET example-index/_search
{
  	"size": 0, 
 	"aggs": {
       "features": {
         "terms": {
           "field": "features-array"
          }
       }		
  	}
}

True Result:

        {
          "key" : "A",
          "doc_count" : 3
        },
        {
          "key" : "B",
          "doc_count" : 1
        },
        {
          "key" : "C",
          "doc_count" : 3
        }

Desired Results:

       {
          "key" : [“A”, “B”, “C”],
          "doc_count" : 1
        },
        {
          "key" : [“A”, “C”],,
          "doc_count" : 2
        }

Let me know if you need any clarifications regarding my questions.
Thanks !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.