Negative number aggregation return wrong results

Negative number aggregation is returning the wrong results. is there any indexing or searching property needs to be applied for this.

Here is the sample data.

TOT_TRAN	ACID
2000	0G50044716
-100	0G50044716
118839	1G60044718
-10000	0G50044716
-80000	0G50044716
214350	0G50044717
-199912	0G50044716
57770	0G50044717
-10	0G50044716
-255	0G50044716
-15000	0G50044716
-61564	0G50044716
62094	0G50044716
-10000	0G50044718
-33165	0G50044716
20000	0G50044717
-29881	0G50044718
100000	0G50044716
-12500	0G50044716
-38261	0G50044716
-10200	0G50044716
37725	0G50044716
-3500	0G50044716

sample query is

{
	"from": 0,
	"size": 0,
	"_source": {
		"includes": [
			"SUM"
		],
		"excludes": []
	},
	"aggregations": {
		"bank_ACID": {
			"terms": {
				"field": "ACID",
				"size": 10,
				"order": {
					"sumValue": "asc"
				}
			},
			"aggregations": {
				"sumValue": {
					"sum": {
						"field": "TOT_TRAN"
					}
				}
			}
		}
	}
}

if I say aggregation size as 10 it will return completely wrong results, if I increase the aggregation size 1000 something like that it will return correct results for limited number records.

checked version : 2.2 and 2.3 elasticsearch
Is there any way I can get the solution for this?

Thank You In advance.

This is due to some design limitations with the terms aggregations, see this section of the documentation for more information on this: https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

Thank You so much for the response.

Yes I understood your point. I am using single node and single shard configuration, still my results are coming inaccurate for the field which contains the negative values.

Is there any way I can get the accurate results?

Hmmm ok, the terms aggregation should be accurate if you only have a single shard. Could you share the json response of the size 10 request and also what you would expect the values to be?

I will also try to reproduce what you are seeing on the sample data you have given

Sure. I have 34752 records in total in my index. Here is my top 10 results json data.

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 34752,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "bank_ACID": {
      "doc_count_error_upper_bound": -1,
      "sum_other_doc_count": 31867,
      "buckets": [
        {
          "key": "OD27437159",
          "doc_count": 181,
          "sumValue": {
            "value": -13076132.889999999
          }
        },
        {
          "key": "OD40711588",
          "doc_count": 157,
          "sumValue": {
            "value": -12668692.59
          }
        },
        {
          "key": "OD44733957",
          "doc_count": 118,
          "sumValue": {
            "value": -8458031
          }
        },
        {
          "key": "OD33506481",
          "doc_count": 394,
          "sumValue": {
            "value": -7281564.550000005
          }
        },
        {
          "key": "OD34764172",
          "doc_count": 303,
          "sumValue": {
            "value": -6543033.07
          }
        },
        {
          "key": "OD25860349",
          "doc_count": 186,
          "sumValue": {
            "value": -5707261.74
          }
        },
        {
          "key": "OD20819040",
          "doc_count": 840,
          "sumValue": {
            "value": -5520830.190000002
          }
        },
        {
          "key": "OD15440785",
          "doc_count": 524,
          "sumValue": {
            "value": -5131579.209999999
          }
        },
        {
          "key": "OD43887242",
          "doc_count": 149,
          "sumValue": {
            "value": -4408034.67
          }
        },
        {
          "key": "OD21222988",
          "doc_count": 33,
          "sumValue": {
            "value": -3535992
          }
        }
      ]
    }
  }
}

Actual expected results for is completely different items. I can share you the excel sample data if it requires to reproduce the issue.

Please let me know.

I haven't been able to reproduce the issue with the sample data above. If you can share your full dataset (you can private message me if you don't want to share this publicly) I'll try to reproduce the issue with that data

sure. will drop you a private response with the sample excel data.

Are you explicitly defining your mappings or are you letting Elasticsearch dynamically create the mappings? If you are letting elasticsearch dynamically create the mapping then I think I know what the issue is. I think the first document indexed will look like an integer value and so the field tot_tran will be mapping to a long. Because of this some precision will be lost for subsequent floating point values which will lead to incorrect results.

That being said even with the index created with the mappings below I still get incorrect sorting as buckets which have a negative sum of tot_tran do not appear in the top N. I will keep looking at this.

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "acid": {
          "type": "keyword"
        },
        "tot_tran": {
          "type": "double",
          "ignore_malformed": false,
          "coerce": false
          
        }
      }
    }
  }
}

Actually I had an error where I wasn't importing all the sample data. If I run with an index created using the mappings above I get the same 10 terms return as I would expect from the excel data

Oh is it! actually, I had created the index with manual mappings not the ES automated mappings. Here is my mappings.

{
  "index": {
    "mappings": {
      "type": {
        "_all": {
          "enabled": false
        },
        "properties": {
          
          "NAME": {
            "type": "string",
            "index": "not_analyzed",
            "fielddata": {
              "format": "doc_values",
              "loading": "eager_global_ordinals"
            },
            "fields": {
              "hash": {
                "type": "murmur3",
                "precision_step": 2147483647
              }
            },
            "null_value": "null",
            "ignore_above": 10922
          },
          "ACID": {
            "type": "string",
            "index": "not_analyzed",
            "fielddata": {
              "format": "doc_values",
              "loading": "eager_global_ordinals"
            },
            "fields": {
              "hash": {
                "type": "murmur3",
                "precision_step": 2147483647
              }
            },
            "null_value": "null",
            "ignore_above": 10922
          },
          "TOT_TRAN": {
            "type": "double"
          }
        }
      }
    }
  }
}

is there anything I need to take care in mappings?

still I am facing same issue with the negative numbers. Are you getting the correct sum aggregation results for the acid grouping?

Yes, using the data you provide me I get the correct results. Maybe try inserting the documents into a test index which is created like my test below and see if the issue reproduces:

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "acid": {
          "type": "keyword"
        },
        "tot_tran": {
          "type": "double",
          "ignore_malformed": false,
          "coerce": false
          
        }
      }
    }
  }
}

I am using ES 2.2. It doesn't have the "keyword" type. I have created mapping with string has the same issue. Is that the problem ?

I checked in the 5.4.2 version it works fine and giving the proper results. But in the older version of 2.X its not giving the correct results. is there any way I can make it work in the 2.x version.

Could you try creating the index with this in 2.2 instead?

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "acid": {
          "type": "string",
          "index": "not_analyzed"
        },
        "tot_tran": {
          "type": "double",
          "ignore_malformed": false,
          "coerce": false
          
        }
      }
    }
  }
}

Yes I tried the same thing. However, This is my sql query :
select sum(TOT_TRAN) as sumValue from index group by ACID order by sumValue asc limit 10

I think you might have tried with out the order by and limit probably. Hence it might be returning the correct result.

same issue is there in all the versions. However, if I set the "shard_size" : 2000 it will give the expected results. How this property will impact the performance?

I have just noticed from your response JSON above that your index is actually using 5 shards:

Previously you had said you were using a single shard so I had based all my answers on that. What you are seeing is a result of what is described in Terms Aggregation | Elasticsearch Reference [5.4] | Elastic

Increasing the shard_size will indeed improve the accuracy of the results but at the cost of streaming more data from the shards to the coordinating node which will affect performance in both network IO and memory.

Given that you only have ~35,000 documents in your index it would be a good idea to use a single shard index, that way you won't need to worry about shard_size at all.

Thank You So much. Nice talking to you. Got the really good understanding and information.

Sorry to trouble you again. Need quick clarification on shards identification. Is there any way we can identify the number shards by looking at the physical folder or file? Or no of shards can be identified only by the logical perspective ?