Bug in aggregation result when using shards

We have found a reproducible issue that an aggregation gives not the correct results when using sharding.

Reproduction (The Python script below can be used for steps 1 and 2):

  1. Create an index with 5 shards with the given the mappings from this JSON: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/sales-production-mappings.json
  2. Add the following 1,500 documents to it: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/sales-production-anonymized.json
  3. Run the following query: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/search_query.json
  4. Write down the results somewhere, in particular the combination of key and total quantity.
  5. Change the "size": 20 to "size" 200 in the search query and run it again
  6. Compare the outputs.

The outputs depend a little bit on which data is stored on which shard, and hence can differ (which is the actual problem!) For me, the results were:

size = 20:

    [(1, 'MDS_FES_brand_supplier_1||article_113||34-46||120', 19.0),
     (2, 'MDS_FES_brand_supplier_1||article_115||34-46||120', 19.0),
     (3, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||990', 14.0),
     (4, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||990', 13.0),
     (5, 'MDS_FES_brand_supplier_1||article_113||603', 12.0),
     (6, 'MDS_FES_brand_supplier_1||article_113||34-46||990', 10.0),
     (7, 'MDS_FES_brand_supplier_1||article_115||34-46||491', 10.0),
     (8, 'MDS_FES_brand_supplier_1||article_113||990', 8.0),
     (9, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||650', 7.0),
     (10, 'MDS_FES_brand_supplier_1||article_113||34-46||348', 6.0),
     (11, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||120', 6.0),
     (12, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||650', 6.0),
     (13, 'MDS_FES_brand_supplier_1||article_115||34-46||348', 6.0),
     (14, 'MDS_FES_brand_supplier_1||article_115||990', 6.0),
     (15, 'MDS_FES_brand_supplier_1||article_115||XS-XL||348', 6.0),
     (16, 'MDS_FES_brand_supplier_1||article_113||120', 5.0),
     (17, 'MDS_FES_brand_supplier_1||article_113||34-46||320', 5.0),
     (18, 'MDS_FES_brand_supplier_1||article_115||120', 5.0),
     (19, 'MDS_FES_brand_supplier_1||article_115||603', 5.0),
     (20, 'MDS_FES_brand_supplier_1||article_115||638', 5.0)]

size = 200:

     [(1, 'MDS_FES_brand_supplier_1||article_113||34-46||120', 19.0),
     (2, 'MDS_FES_brand_supplier_1||article_115||34-46||120', 19.0),
     (3, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||990', 14.0),
     (4, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||990', 13.0),
     (5, 'MDS_FES_brand_supplier_1||article_113||603', 12.0),
     (6, 'MDS_FES_brand_supplier_1||article_115||34-46||491', 10.0),
     (7, 'MDS_FES_brand_supplier_1||article_113||34-46||990', 9.0),
     (8, 'MDS_FES_brand_supplier_1||article_113||990', 8.0),
     (9, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||650', 7.0),
     (10, 'MDS_FES_brand_supplier_1||article_113||34-46||348', 6.0),
     (11, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||120', 6.0),
     (12, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||650', 6.0),
     (13, 'MDS_FES_brand_supplier_1||article_115||990', 6.0),
     (14, 'MDS_FES_brand_supplier_1||article_115||XS-XL||348', 6.0),
     (15, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||491', 6.0),
     (16, 'MDS_FES_brand_supplier_1||article_113||120', 5.0),
     (17, 'MDS_FES_brand_supplier_1||article_115||120', 5.0),
     (18, 'MDS_FES_brand_supplier_1||article_115||34-46||348', 5.0),
     (19, 'MDS_FES_brand_supplier_1||article_115||603', 5.0),
     (20, 'MDS_FES_brand_supplier_1||article_115||638', 5.0),

As one can see, the bucket with key MDS_FES_brand_supplier_1||article_113||34-46||990 contains a total quantity of 10 in the first case and of 9 in the latter case. It turns out that in the first case, not all data was taken into account (one document with quantity = -1 is not taken into account). One can verify this by adding a query {'term': {'catalogs.full_submodel_grouping': 'MDS_FES_brand_supplier_1||article_113||34-46||990'}} to the search query. In both cases, the result is now 9.

I tested this on ElasticSearch 6.4.2 with two nodes. This problem disappears when executed on an index with only 1 shard.

Python scripts for reproduction:

def restore_snapshot():
    index = 'sales-reproduction'
    delete_index(index)
    with open('sales-production-mappings.json', 'r') as f:
        mappings = json.loads(f.read())
    create_index(index, settings={'number_of_shards': 5}, mappings=mappings)
    with open('sales-production-anonymized.json', 'r') as f:
        result = json.loads(f.read())
    for sale in result['hits']['hits']:
        source = sale['_source']
        post_to_index(index, source)

Some of the aggregations in Elasticsearch are approximate, and this includes the terms aggregation. This trade-off has been made in order to be able to aggregate large data volumes fast, and the reason behind this is explained quite well in this post.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.