Hello everyone,
I recently came across a curious bug (i'm using version 6.2.3 of elasticsearch) occuring under very particular circumstances.
The problem is on sorting parent documents based on a field from their nested document. When the data is indexed with Bulk API (which is my case), the result sort value seems to be wrong and does not belong to any nested document. However, when the data is indexed with Index API, the problem doesn't occur.
I created a case under which it's possible to reproduce the bug :
Using bulk indexation (occuring the bug)
# Create index
PUT tree
{ "settings": {"number_of_shards": 1,"number_of_replicas": 0 } }
# Put mapping
PUT tree/family/_mapping
{"properties":{"name":{"type":"keyword"},"members":{"type":"nested","properties":{"firstname":{"type":"keyword"},"color":{"type":"keyword"},"levels":{"type":"nested","properties":{"strength":{"type":"integer"}}}}}}}
# Insert data (bulk index API)
POST _bulk
{ "index" : { "_index" : "tree", "_type" : "family", "_id" : "1" } }
{"name":"Doe","members":[{"firstName":"John","color":"brown","levels":{"strength":10}},{"firstName":"Serge","color":"brown","levels":{"strength":15}},{"firstName":"Marie","color":"brown","levels":{"strength":20}}]}
{ "index" : { "_index" : "tree", "_type" : "family", "_id" : "2" } }
{"name":"Simpson","members":[{"firstName":"Homer","color":"brown","levels":{"strength":30}},{"firstName":"Lisa","color":"brown","levels":{"strength":40}},{"firstName":"Marge","color":"brown","levels":{"strength":60}}]}
{ "index" : { "_index" : "tree", "_type" : "family", "_id" : "3" } }
{"name":"Simpson","members":[{"firstName":"Bart","color":"yellow","levels":{"strength":70}},{"firstName":"Snowball","color":"yellow","levels":{"strength":80}},{"firstName":"Maggie","color":"yellow","levels":{"strength":90}},{"firstName":"Gandpa","color":"brown","levels":{"strength":95}}]}
# Query
GET tree/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"name": {
"value": "Simpson"
}
}
},
{
"nested": {
"path" : "members",
"query": {
"bool" : {
"filter" : [
{
"term" : {
"members.color" : {
"value" : "brown"
}
}
}
]
}
}
}
}
]
}
},
"sort": [
{
"members.levels.strength": {
"order": "asc",
"nested": {
"path": "members",
"filter": {
"term" : {
"members.color" : {
"value" : "brown"
}
}
},
"nested": {
"path": "members.levels"
}
}
}
}
]
}
# Results
{
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "tree",
"_type": "family",
"_id": "2",
"_score": null,
"_source": {
"name": "Simpson",
"members": [
{
"firstName": "Homer",
"color": "brown",
"levels": {
"strength": 30
}
},
{
"firstName": "Lisa",
"color": "brown",
"levels": {
"strength": 40
}
},
{
"firstName": "Marge",
"color": "brown",
"levels": {
"strength": 60
}
}
]
},
"sort": [
10
]
},
...
]
}
}
As we can see here, the family with id=2 is getting a sort value of "10", value that doesnt exist in the document (it exists on another document with id=1 , but this one is filtered by the query).
Using Index API (no bug in this case)
# Index data
POST tree/family
{"name":"Doe","members":[{"firstName":"John","color":"brown","levels":{"strength":10}},{"firstName":"Serge","color":"brown","levels":{"strength":15}},{"firstName":"Marie","color":"brown","levels":{"strength":20}}]}
POST tree/family
{"name":"Simpson","members":[{"firstName":"Homer","color":"brown","levels":{"strength":30}},{"firstName":"Lisa","color":"brown","levels":{"strength":40}},{"firstName":"Marge","color":"brown","levels":{"strength":60}}]}
POST tree/family
{"name":"Simpson","members":[{"firstName":"Bart","color":"yellow","levels":{"strength":70}},{"firstName":"Snowball","color":"yellow","levels":{"strength":80}},{"firstName":"Maggie","color":"yellow","levels":{"strength":90}},{"firstName":"Gandpa","color":"brown","levels":{"strength":95}}]}
# Results after playing same exact query
{
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "tree",
"_type": "family",
"_id": "4YzqfmQBYPeBZjknedgI",
"_score": null,
"_source": {
"name": "Simpson",
"members": [
{
"firstName": "Homer",
"color": "brown",
"levels": {
"strength": 30
}
},
{
"firstName": "Lisa",
"color": "brown",
"levels": {
"strength": 40
}
},
{
"firstName": "Marge",
"color": "brown",
"levels": {
"strength": 60
}
}
]
},
"sort": [
30
]
},
...
]
}
}
As we can see this time, the family with id=2 is getting the right sort value "30".
Does anyone know what's happening here ?
When investigating this issue, the only big difference I could find is the way elasticsearch seems to segment the data differently on Lucene while using bulk API or regular index API.
When using bulk on my example and call "GET tree/_segments" , we can see that ES is creating all documents in only one segment, while it seems to create a segment per document in the case of a regular indexation.
Thank you for reading me, and thanks for any suggestions that could help me figure out how to work this out.
Regards,
Julien Colin