# Illustration of DocValues, Fielddata and Inverted Index

Hello,

I would like to better understand internal Elasticsearch data structures, more specifically 1) inverted index 2) fielddata and 3) DocValues
Assume we have the following use case:

Sample data 1:
We’re given a "features" field that contains tilde separated letters (each being a feature).

``````{
"docID": 1,
"features": "A~B~C"
},
{
"docID": 2,
"features": "A~C"
},
{
"docID": 3,
"features": "A~C"
}

``````

We want to be able to aggregate the features individually across an index:

``````{
"key": "A",
"doc_count": 3
},
{
"key": "B",
"doc_count": 1
},
{
"key": "C",
"doc_count": 3
}
``````

And aggregate them as they occurred combinatorially (i.e. together):

``````{
"key": "A~B~C",
"doc_count": 1
},
{
"key": "A~C",
"doc_count": 2
}

``````

Mapping 1:

``````"mappings": {
"properties": {
"features" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
},
"analyzer" : "feature_analyzer",
"fielddata" : true
}
}
}
``````

Where the feature analyzer is a simple split along tildes.

QUESTION:
Could you kindly confirm/correct the following data structure hypotheses for the sample data above ?:

DocValues:

DocID Term
1 A~B~C
2 A~C
3 A~C

Inverted Index:

Term DocID
A 1, 2, 3
B 1
C 1, 2, 3

Fielddata, two hypothesis - Inverting the inverted index, what does it really do?

Fielddata - 1:

DocID Terms
1, 2, 3 A
1 B
1, 2, 3 C

Fielddata - 2:

DocID Terms
1 A, B, C or [A, B, C]
2 A, C or [A, C]
3 A, C or [A, C]

We wish to avoid using fielddata.
Now let’s suppose instead of having a multi-field, we duplicate the "features" field, once stored as tilde separated keywords and once as an array:

Let's take the following example:

Sample data 2:

``````{
"docID": 1,
"features": "A~B~C",
"features-array": [A,B,C]
},
"docID": 2,
"features": "A~C",
"features-array": [A,C]
},
{
"docID": 3,
"features": "A~C",
"features-array": [A,C]
}
``````

Mapping 2:

``````"mappings": {
"properties": {
"features": {
"type" : "keyword"
},
"features-array": {
"type": "keyword"
}
}
}
``````

QUESTION:
What are the associated data structures in this case ? Could you kindly confirm confirm/correct the following and illustrate the data structures for the "features-array" field?

• For "features" field:

DocValues:

DocID Term
1 A~B~C
2 A~C
3 A~C

Inverted Index:

Term DocID
A~B~C 1
A~C 2, 3
• For "features-array" field I'm completely unsure

QUESTION:

Given the latter mapping, why can’t the following aggregation identify unique arrays?

``````GET example-index/_search
{
"size": 0,
"aggs": {
"features": {
"terms": {
"field": "features-array"
}
}
}
}
``````

True Result:

``````        {
"key" : "A",
"doc_count" : 3
},
{
"key" : "B",
"doc_count" : 1
},
{
"key" : "C",
"doc_count" : 3
}
``````

Desired Results:

``````       {
"key" : [“A”, “B”, “C”],
"doc_count" : 1
},
{
"key" : [“A”, “C”],,
"doc_count" : 2
}

``````

Let me know if you need any clarifications regarding my questions.
Thanks !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.