Issue sorting nested documents indexed via Bulk

Hello everyone,
I recently came across a curious bug (i'm using version 6.2.3 of elasticsearch) occuring under very particular circumstances.
The problem is on sorting parent documents based on a field from their nested document. When the data is indexed with Bulk API (which is my case), the result sort value seems to be wrong and does not belong to any nested document. However, when the data is indexed with Index API, the problem doesn't occur.

I created a case under which it's possible to reproduce the bug :

Using bulk indexation (occuring the bug)

# Create index
PUT tree 
{ "settings": {"number_of_shards": 1,"number_of_replicas": 0 } }

# Put mapping
PUT tree/family/_mapping
{"properties":{"name":{"type":"keyword"},"members":{"type":"nested","properties":{"firstname":{"type":"keyword"},"color":{"type":"keyword"},"levels":{"type":"nested","properties":{"strength":{"type":"integer"}}}}}}}

# Insert data (bulk index API)
POST _bulk
{ "index" : { "_index" : "tree", "_type" : "family", "_id" : "1" } }
{"name":"Doe","members":[{"firstName":"John","color":"brown","levels":{"strength":10}},{"firstName":"Serge","color":"brown","levels":{"strength":15}},{"firstName":"Marie","color":"brown","levels":{"strength":20}}]}
{ "index" : { "_index" : "tree", "_type" : "family", "_id" : "2" } }
{"name":"Simpson","members":[{"firstName":"Homer","color":"brown","levels":{"strength":30}},{"firstName":"Lisa","color":"brown","levels":{"strength":40}},{"firstName":"Marge","color":"brown","levels":{"strength":60}}]}
{ "index" : { "_index" : "tree", "_type" : "family", "_id" : "3" } }
{"name":"Simpson","members":[{"firstName":"Bart","color":"yellow","levels":{"strength":70}},{"firstName":"Snowball","color":"yellow","levels":{"strength":80}},{"firstName":"Maggie","color":"yellow","levels":{"strength":90}},{"firstName":"Gandpa","color":"brown","levels":{"strength":95}}]}

# Query
GET tree/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "name": {
              "value": "Simpson"
            }
          }
        },
        {
          "nested": {
            "path" : "members",
  					"query": {
  					    "bool" : {
  					      "filter" : [
  					          {
          							"term" : {
          								"members.color" :  {
          								  "value" : "brown"
          								}
          							}
  					          }
  					       ]
  					    }
  						}
  				}
				}
       ]
    }
  },
  "sort": [
    {
      "members.levels.strength": {
        "order": "asc",
  			"nested": {
  				"path": "members",
  				"filter": {
  					"term" : {
  						"members.color" : {
  							"value" : "brown"
  						}
  					}
  				},
  				"nested": {
  					"path": "members.levels"
  				}
  			}
      }
    }
  ]
}

# Results
{
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "tree",
        "_type": "family",
        "_id": "2",
        "_score": null,
        "_source": {
          "name": "Simpson",
          "members": [
            {
              "firstName": "Homer",
              "color": "brown",
              "levels": {
                "strength": 30
              }
            },
            {
              "firstName": "Lisa",
              "color": "brown",
              "levels": {
                "strength": 40
              }
            },
            {
              "firstName": "Marge",
              "color": "brown",
              "levels": {
                "strength": 60
              }
            }
          ]
        },
        "sort": [
          10
        ]
      },
     ...
    ]
  }
}

As we can see here, the family with id=2 is getting a sort value of "10", value that doesnt exist in the document (it exists on another document with id=1 , but this one is filtered by the query).

Using Index API (no bug in this case)

# Index data
POST tree/family
{"name":"Doe","members":[{"firstName":"John","color":"brown","levels":{"strength":10}},{"firstName":"Serge","color":"brown","levels":{"strength":15}},{"firstName":"Marie","color":"brown","levels":{"strength":20}}]}
POST tree/family
{"name":"Simpson","members":[{"firstName":"Homer","color":"brown","levels":{"strength":30}},{"firstName":"Lisa","color":"brown","levels":{"strength":40}},{"firstName":"Marge","color":"brown","levels":{"strength":60}}]}
POST tree/family
{"name":"Simpson","members":[{"firstName":"Bart","color":"yellow","levels":{"strength":70}},{"firstName":"Snowball","color":"yellow","levels":{"strength":80}},{"firstName":"Maggie","color":"yellow","levels":{"strength":90}},{"firstName":"Gandpa","color":"brown","levels":{"strength":95}}]}

# Results after playing same exact query
{
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "tree",
        "_type": "family",
        "_id": "4YzqfmQBYPeBZjknedgI",
        "_score": null,
        "_source": {
          "name": "Simpson",
          "members": [
            {
              "firstName": "Homer",
              "color": "brown",
              "levels": {
                "strength": 30
              }
            },
            {
              "firstName": "Lisa",
              "color": "brown",
              "levels": {
                "strength": 40
              }
            },
            {
              "firstName": "Marge",
              "color": "brown",
              "levels": {
                "strength": 60
              }
            }
          ]
        },
        "sort": [
          30
        ]
      },
     ...
    ]
  }
}

As we can see this time, the family with id=2 is getting the right sort value "30".

Does anyone know what's happening here ?

When investigating this issue, the only big difference I could find is the way elasticsearch seems to segment the data differently on Lucene while using bulk API or regular index API.

When using bulk on my example and call "GET tree/_segments" , we can see that ES is creating all documents in only one segment, while it seems to create a segment per document in the case of a regular indexation.

Thank you for reading me, and thanks for any suggestions that could help me figure out how to work this out.

Regards,
Julien Colin

1 Like

Note that indexing documents one by one works for this very small example, but fails when indexing a large amount of data. From what I can see, while indexing a larger number of documents, Elasticsearch starts merging segment together , and the same bug appears.

Hello.
If no one has insight on this, do you advise me to create an issue on the elasticsearch repository directly ?

For information , this issue has been treated and will be made available in elasticsearch 6.3.3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.