Hello,
I would like to better understand internal Elasticsearch data structures, more specifically 1) inverted index 2) fielddata and 3) DocValues
Assume we have the following use case:
Sample data 1:
We’re given a "features" field that contains tilde separated letters (each being a feature).
{
"docID": 1,
"features": "A~B~C"
},
{
"docID": 2,
"features": "A~C"
},
{
"docID": 3,
"features": "A~C"
}
We want to be able to aggregate the features individually across an index:
{
"key": "A",
"doc_count": 3
},
{
"key": "B",
"doc_count": 1
},
{
"key": "C",
"doc_count": 3
}
And aggregate them as they occurred combinatorially (i.e. together):
{
"key": "A~B~C",
"doc_count": 1
},
{
"key": "A~C",
"doc_count": 2
}
Mapping 1:
"mappings": {
"properties": {
"features" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
},
"analyzer" : "feature_analyzer",
"fielddata" : true
}
}
}
Where the feature analyzer is a simple split along tildes.
QUESTION:
Could you kindly confirm/correct the following data structure hypotheses for the sample data above ?:
DocValues:
DocID | Term |
---|---|
1 | A~B~C |
2 | A~C |
3 | A~C |
Inverted Index:
Term | DocID |
---|---|
A | 1, 2, 3 |
B | 1 |
C | 1, 2, 3 |
Fielddata, two hypothesis - Inverting the inverted index, what does it really do?
Fielddata - 1:
DocID | Terms |
---|---|
1, 2, 3 | A |
1 | B |
1, 2, 3 | C |
Fielddata - 2:
DocID | Terms |
---|---|
1 | A, B, C or [A, B, C] |
2 | A, C or [A, C] |
3 | A, C or [A, C] |
We wish to avoid using fielddata.
Now let’s suppose instead of having a multi-field, we duplicate the "features" field, once stored as tilde separated keywords and once as an array:
Let's take the following example:
Sample data 2:
{
"docID": 1,
"features": "A~B~C",
"features-array": [A,B,C]
},
"docID": 2,
"features": "A~C",
"features-array": [A,C]
},
{
"docID": 3,
"features": "A~C",
"features-array": [A,C]
}
Mapping 2:
"mappings": {
"properties": {
"features": {
"type" : "keyword"
},
"features-array": {
"type": "keyword"
}
}
}
QUESTION:
What are the associated data structures in this case ? Could you kindly confirm confirm/correct the following and illustrate the data structures for the "features-array" field?
- For "features" field:
DocValues:
DocID | Term |
---|---|
1 | A~B~C |
2 | A~C |
3 | A~C |
Inverted Index:
Term | DocID |
---|---|
A~B~C | 1 |
A~C | 2, 3 |
- For "features-array" field I'm completely unsure
QUESTION:
Given the latter mapping, why can’t the following aggregation identify unique arrays?
GET example-index/_search
{
"size": 0,
"aggs": {
"features": {
"terms": {
"field": "features-array"
}
}
}
}
True Result:
{
"key" : "A",
"doc_count" : 3
},
{
"key" : "B",
"doc_count" : 1
},
{
"key" : "C",
"doc_count" : 3
}
Desired Results:
{
"key" : [“A”, “B”, “C”],
"doc_count" : 1
},
{
"key" : [“A”, “C”],,
"doc_count" : 2
}
Let me know if you need any clarifications regarding my questions.
Thanks !