Bug in aggregation result when using shards


(Erwin Rossen) #1

We have found a reproducible issue that an aggregation gives not the correct results when using sharding.

Reproduction (The Python script below can be used for steps 1 and 2):

  1. Create an index with 5 shards with the given the mappings from this JSON: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/sales-production-mappings.json
  2. Add the following 1,500 documents to it: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/sales-production-anonymized.json
  3. Run the following query: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/search_query.json
  4. Write down the results somewhere, in particular the combination of key and total quantity.
  5. Change the "size": 20 to "size" 200 in the search query and run it again
  6. Compare the outputs.

The outputs depend a little bit on which data is stored on which shard, and hence can differ (which is the actual problem!) For me, the results were:

size = 20:

    [(1, 'MDS_FES_brand_supplier_1||article_113||34-46||120', 19.0),
     (2, 'MDS_FES_brand_supplier_1||article_115||34-46||120', 19.0),
     (3, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||990', 14.0),
     (4, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||990', 13.0),
     (5, 'MDS_FES_brand_supplier_1||article_113||603', 12.0),
     (6, 'MDS_FES_brand_supplier_1||article_113||34-46||990', 10.0),
     (7, 'MDS_FES_brand_supplier_1||article_115||34-46||491', 10.0),
     (8, 'MDS_FES_brand_supplier_1||article_113||990', 8.0),
     (9, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||650', 7.0),
     (10, 'MDS_FES_brand_supplier_1||article_113||34-46||348', 6.0),
     (11, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||120', 6.0),
     (12, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||650', 6.0),
     (13, 'MDS_FES_brand_supplier_1||article_115||34-46||348', 6.0),
     (14, 'MDS_FES_brand_supplier_1||article_115||990', 6.0),
     (15, 'MDS_FES_brand_supplier_1||article_115||XS-XL||348', 6.0),
     (16, 'MDS_FES_brand_supplier_1||article_113||120', 5.0),
     (17, 'MDS_FES_brand_supplier_1||article_113||34-46||320', 5.0),
     (18, 'MDS_FES_brand_supplier_1||article_115||120', 5.0),
     (19, 'MDS_FES_brand_supplier_1||article_115||603', 5.0),
     (20, 'MDS_FES_brand_supplier_1||article_115||638', 5.0)]

size = 200:

     [(1, 'MDS_FES_brand_supplier_1||article_113||34-46||120', 19.0),
     (2, 'MDS_FES_brand_supplier_1||article_115||34-46||120', 19.0),
     (3, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||990', 14.0),
     (4, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||990', 13.0),
     (5, 'MDS_FES_brand_supplier_1||article_113||603', 12.0),
     (6, 'MDS_FES_brand_supplier_1||article_115||34-46||491', 10.0),
     (7, 'MDS_FES_brand_supplier_1||article_113||34-46||990', 9.0),
     (8, 'MDS_FES_brand_supplier_1||article_113||990', 8.0),
     (9, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||650', 7.0),
     (10, 'MDS_FES_brand_supplier_1||article_113||34-46||348', 6.0),
     (11, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||120', 6.0),
     (12, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||650', 6.0),
     (13, 'MDS_FES_brand_supplier_1||article_115||990', 6.0),
     (14, 'MDS_FES_brand_supplier_1||article_115||XS-XL||348', 6.0),
     (15, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||491', 6.0),
     (16, 'MDS_FES_brand_supplier_1||article_113||120', 5.0),
     (17, 'MDS_FES_brand_supplier_1||article_115||120', 5.0),
     (18, 'MDS_FES_brand_supplier_1||article_115||34-46||348', 5.0),
     (19, 'MDS_FES_brand_supplier_1||article_115||603', 5.0),
     (20, 'MDS_FES_brand_supplier_1||article_115||638', 5.0),

As one can see, the bucket with key MDS_FES_brand_supplier_1||article_113||34-46||990 contains a total quantity of 10 in the first case and of 9 in the latter case. It turns out that in the first case, not all data was taken into account (one document with quantity = -1 is not taken into account). One can verify this by adding a query {'term': {'catalogs.full_submodel_grouping': 'MDS_FES_brand_supplier_1||article_113||34-46||990'}} to the search query. In both cases, the result is now 9.

I tested this on ElasticSearch 6.4.2 with two nodes. This problem disappears when executed on an index with only 1 shard.

Python scripts for reproduction:

def restore_snapshot():
    index = 'sales-reproduction'
    delete_index(index)
    with open('sales-production-mappings.json', 'r') as f:
        mappings = json.loads(f.read())
    create_index(index, settings={'number_of_shards': 5}, mappings=mappings)
    with open('sales-production-anonymized.json', 'r') as f:
        result = json.loads(f.read())
    for sale in result['hits']['hits']:
        source = sale['_source']
        post_to_index(index, source)

(Christian Dahlqvist) #2

Some of the aggregations in Elasticsearch are approximate, and this includes the terms aggregation. This trade-off has been made in order to be able to aggregate large data volumes fast, and the reason behind this is explained quite well in this post.