We have found a reproducible issue that an aggregation gives not the correct results when using sharding.
Reproduction (The Python script below can be used for steps 1 and 2):
- Create an index with 5 shards with the given the mappings from this JSON: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/sales-production-mappings.json
- Add the following 1,500 documents to it: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/sales-production-anonymized.json
- Run the following query: https://s3.eu-central-1.amazonaws.com/nyon-shared-storage/search_query.json
- Write down the results somewhere, in particular the combination of key and total quantity.
- Change the "size": 20 to "size" 200 in the search query and run it again
- Compare the outputs.
The outputs depend a little bit on which data is stored on which shard, and hence can differ (which is the actual problem!) For me, the results were:
size = 20:
[(1, 'MDS_FES_brand_supplier_1||article_113||34-46||120', 19.0),
(2, 'MDS_FES_brand_supplier_1||article_115||34-46||120', 19.0),
(3, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||990', 14.0),
(4, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||990', 13.0),
(5, 'MDS_FES_brand_supplier_1||article_113||603', 12.0),
(6, 'MDS_FES_brand_supplier_1||article_113||34-46||990', 10.0),
(7, 'MDS_FES_brand_supplier_1||article_115||34-46||491', 10.0),
(8, 'MDS_FES_brand_supplier_1||article_113||990', 8.0),
(9, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||650', 7.0),
(10, 'MDS_FES_brand_supplier_1||article_113||34-46||348', 6.0),
(11, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||120', 6.0),
(12, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||650', 6.0),
(13, 'MDS_FES_brand_supplier_1||article_115||34-46||348', 6.0),
(14, 'MDS_FES_brand_supplier_1||article_115||990', 6.0),
(15, 'MDS_FES_brand_supplier_1||article_115||XS-XL||348', 6.0),
(16, 'MDS_FES_brand_supplier_1||article_113||120', 5.0),
(17, 'MDS_FES_brand_supplier_1||article_113||34-46||320', 5.0),
(18, 'MDS_FES_brand_supplier_1||article_115||120', 5.0),
(19, 'MDS_FES_brand_supplier_1||article_115||603', 5.0),
(20, 'MDS_FES_brand_supplier_1||article_115||638', 5.0)]
size = 200:
[(1, 'MDS_FES_brand_supplier_1||article_113||34-46||120', 19.0),
(2, 'MDS_FES_brand_supplier_1||article_115||34-46||120', 19.0),
(3, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||990', 14.0),
(4, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||990', 13.0),
(5, 'MDS_FES_brand_supplier_1||article_113||603', 12.0),
(6, 'MDS_FES_brand_supplier_1||article_115||34-46||491', 10.0),
(7, 'MDS_FES_brand_supplier_1||article_113||34-46||990', 9.0),
(8, 'MDS_FES_brand_supplier_1||article_113||990', 8.0),
(9, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||650', 7.0),
(10, 'MDS_FES_brand_supplier_1||article_113||34-46||348', 6.0),
(11, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||120', 6.0),
(12, 'MDS_FES_brand_supplier_1||article_113||XS-XXL||650', 6.0),
(13, 'MDS_FES_brand_supplier_1||article_115||990', 6.0),
(14, 'MDS_FES_brand_supplier_1||article_115||XS-XL||348', 6.0),
(15, 'MDS_FES_brand_supplier_1||article_115||XS-XXL||491', 6.0),
(16, 'MDS_FES_brand_supplier_1||article_113||120', 5.0),
(17, 'MDS_FES_brand_supplier_1||article_115||120', 5.0),
(18, 'MDS_FES_brand_supplier_1||article_115||34-46||348', 5.0),
(19, 'MDS_FES_brand_supplier_1||article_115||603', 5.0),
(20, 'MDS_FES_brand_supplier_1||article_115||638', 5.0),
As one can see, the bucket with key MDS_FES_brand_supplier_1||article_113||34-46||990
contains a total quantity of 10 in the first case and of 9 in the latter case. It turns out that in the first case, not all data was taken into account (one document with quantity = -1 is not taken into account). One can verify this by adding a query {'term': {'catalogs.full_submodel_grouping': 'MDS_FES_brand_supplier_1||article_113||34-46||990'}}
to the search query. In both cases, the result is now 9.
I tested this on ElasticSearch 6.4.2 with two nodes. This problem disappears when executed on an index with only 1 shard.
Python scripts for reproduction:
def restore_snapshot():
index = 'sales-reproduction'
delete_index(index)
with open('sales-production-mappings.json', 'r') as f:
mappings = json.loads(f.read())
create_index(index, settings={'number_of_shards': 5}, mappings=mappings)
with open('sales-production-anonymized.json', 'r') as f:
result = json.loads(f.read())
for sale in result['hits']['hits']:
source = sale['_source']
post_to_index(index, source)