Illustration of DocValues, Fielddata and Inverted Index

galambo · October 26, 2021, 1:18pm

Hello,

I would like to better understand internal Elasticsearch data structures, more specifically 1) inverted index 2) fielddata and 3) DocValues
Assume we have the following use case:

Sample data 1:
We’re given a "features" field that contains tilde separated letters (each being a feature).

{
        "docID": 1,
     "features": "A~B~C"
},
{
		"docID": 2,
     "features": "A~C"
},
{
		"docID": 3,
     "features": "A~C"
}

We want to be able to aggregate the features individually across an index:

{
			"key": "A",
			"doc_count": 3
},
{
			"key": "B",
			"doc_count": 1
},
{
			"key": "C",
			"doc_count": 3
}

And aggregate them as they occurred combinatorially (i.e. together):

{
			"key": "A~B~C",
			"doc_count": 1
},
{
			"key": "A~C",
			"doc_count": 2
}

Mapping 1:

"mappings": {
   "properties": {
      "features" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword"
                  }
                },
                "analyzer" : "feature_analyzer",
                "fielddata" : true
         }
    }
}

Where the feature analyzer is a simple split along tildes.

QUESTION:
Could you kindly confirm/correct the following data structure hypotheses for the sample data above ?:

DocValues:

DocID	Term
1	A~B~C
2	A~C
3	A~C

Inverted Index:

Term	DocID
A	1, 2, 3
B	1
C	1, 2, 3

Fielddata, two hypothesis - Inverting the inverted index, what does it really do?

Fielddata - 1:

DocID	Terms
1, 2, 3	A
1	B
1, 2, 3	C

Fielddata - 2:

DocID	Terms
1	A, B, C or [A, B, C]
2	A, C or [A, C]
3	A, C or [A, C]

We wish to avoid using fielddata.
Now let’s suppose instead of having a multi-field, we duplicate the "features" field, once stored as tilde separated keywords and once as an array:

Let's take the following example:

Sample data 2:

{
        "docID": 1,
     "features": "A~B~C",
  "features-array": [A,B,C]
},
		"docID": 2,
     "features": "A~C",
  "features-array": [A,C]
},
{
		"docID": 3,
     "features": "A~C",
  "features-array": [A,C]
}

Mapping 2:

"mappings": {
   "properties": {
       "features": {
         "type" : "keyword"
   },
      "features-array": {
        "type": "keyword"
   }
  }
}

QUESTION:
What are the associated data structures in this case ? Could you kindly confirm confirm/correct the following and illustrate the data structures for the "features-array" field?

For "features" field:

DocValues:

DocID	Term
1	A~B~C
2	A~C
3	A~C

Inverted Index:

Term	DocID
A~B~C	1
A~C	2, 3

For "features-array" field I'm completely unsure

QUESTION:

Given the latter mapping, why can’t the following aggregation identify unique arrays?

GET example-index/_search
{
  	"size": 0, 
 	"aggs": {
       "features": {
         "terms": {
           "field": "features-array"
          }
       }		
  	}
}

True Result:

        {
          "key" : "A",
          "doc_count" : 3
        },
        {
          "key" : "B",
          "doc_count" : 1
        },
        {
          "key" : "C",
          "doc_count" : 3
        }

Desired Results:

       {
          "key" : [“A”, “B”, “C”],
          "doc_count" : 1
        },
        {
          "key" : [“A”, “C”],,
          "doc_count" : 2
        }

Let me know if you need any clarifications regarding my questions.
Thanks !

system · November 23, 2021, 1:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why do we need field-data? Elasticsearch	4	580	May 22, 2017
Keyword, doc_value and analysis Elasticsearch	3	1463	September 22, 2019
Doc_values and Inverted Index in Elastic 8.1 Kibana	5	330	May 23, 2022
Doc values vs inverted index Elasticsearch	2	763	July 5, 2017
Indexing performance with doc values (particularly with larger number of fields) Elasticsearch	2	570	July 6, 2017

Illustration of DocValues, Fielddata and Inverted Index

Related topics