Inaccuracy for Geo Centroid Aggregation and Geo Bounds Aggregation

gemo1011 · April 8, 2019, 12:43pm

Hello,

while testing some geo functions for Elasticsearch I discovered an inacurracy for the Geo Centroid Aggregation and the Geo Bounds Aggregation.

For test purpose I'm using the geonames dataset. An collection of around 11.8 million geopoints. I used the following query to compute the centroid of this dataset:

{
    "aggs" : {
        "centroid" : {
            "geo_centroid" : {
                "field" : "location" 
            }
        }
    }
}

I also checked for the centroid of the dataset with PostGIS and the following query:

SELECT avg(ST_X(the_geom)) as lon, avg(ST_Y(the_geom)) as lat FROM geonames

I compared the two centroids and saw that there is a difference of around 0,2487° longitude and around 0,1242° latitude. Calculating the distance between this two centroids with the haversin-formula I got a distance of nearly 28 kilometer.

So I was testing the Centroid Aggregation with some excerpts of the geonames dataset. Here is an output for 6 points where you can see the inaccuracy of the centroid (calculating the average for lat and lon):

{
  "took" : 52,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "0LM0LmkBrQo0YN4q8mim",
        "_score" : 1.0,
        "_source" : {
          "id" : "3205376",
          "name" : "Metohija",
          "location" : {
            "lat" : "42.84111",
            "lon" : "17.63361"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "7LM0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286751",
          "name" : "Ledinići",
          "location" : {
            "lat" : "42.85083",
            "lon" : "17.62"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "-LM0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286785",
          "name" : "Boljenovići",
          "location" : {
            "lat" : "42.84806",
            "lon" : "17.62472"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "-bM0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286786",
          "name" : "Gornje Selo",
          "location" : {
            "lat" : "42.84611",
            "lon" : "17.63194"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "_7M0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286800",
          "name" : "Bojnoge",
          "location" : {
            "lat" : "42.84472",
            "lon" : "17.63389"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "wrM0LmkBrQo0YN4q8pyp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3288000",
          "name" : "Ponikve",
          "location" : {
            "lat" : "42.84472",
            "lon" : "17.61306"
          }
        }
      }
    ]
  },
  "aggregations" : {
    "centroid" : {
      "location" : {
        "lat" : 42.84592493902892,
        "lon" : 17.62620317749679
      },
      "count" : 6
    }
  }
}

Here is a plot where the inaccuracy I got with my testing querys is visualized:
plot_discuss_elastic

I also got this kind of inaccuracy when testing the Geo Bounds Aggregation (coordinates of top_left and bottom_right are inaccurate). So basically this aggregations uses average (centroid) and min/max (bounds). So where is this coming from ? I was looking a bit at the source code and saw something about decoding and encoding the coordinates for calculating the aggregations, maybe it's because of this ?

Ignacio_Vera · April 9, 2019, 7:58am

The effect you are explaining here is due to the way Elasticsearch ( more exactly Lucene) index points. Instead of index the points using doubles, it encodes the latitude and longitude using integers. This strategy reduces the size of the index but it adds a small inaccuracy of around 1e-7 degrees (around 1cm on the surface of the earth).

This is the inaccuracy I expect you see in the geo bounds aggregation. For the centroid the inaccuracy is magnified because when calculating the moving average for the latitude and longitude, the logic seems to encode and decode the partial results.

Ignacio_Vera · April 9, 2019, 8:55pm

FYI: https://github.com/elastic/elasticsearch/issues/41032

gemo1011 · April 10, 2019, 7:30am

Nice, thank you !

Sprungwunder · April 25, 2019, 1:09pm

edit:
I can't reproduce this on our production stack, maybe this is a side effect of our development elastic "virtual" cluster running 3 ES instances on the same VM.
The developemt data restored on the production cluster leads to the same result for each replayed query, so everything fine and as expoected.
/edit

This may be a related problem:
What I see is that results seem to alternate between two values for exact the same query.
If I replay the same query in Kibana I constantly get two different results with changes up to 1e-7. That means:

Query A -> Result A
Query A -> Result A'
Query A -> Result A
Query A -> Result A'
and so on.

If I mix two queries, I get the same results:

Query A -> Result A
Query B -> Result B
Query A -> Result A
Query B -> Result B
Query C -> Result C
Query A -> Result A'
Query B -> Result B'
Query B -> Result B

Means: If I replay an even number of queries, I get constant results, If I replay an odd number, I get alternating results.

Does the mentoined fix also fix this behaviour?

POST geo_data/_search?size=0
{
  "aggs": {
    "location": {
      "geo_centroid": {
            "field": "position"
          }
    }
  }
}

Response either

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 19434896,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "location" : {
      "location" : {
        "lat" : 1.2363329981435465,
        "lon" : 103.64341515934639
      },
      "count" : 64802
    }
  }
}

or

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 19434896,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "location": {
            "location": {
                "lat": 1.2363330319779686,
                "lon": 103.64341576836598
            },
            "count": 64802
        }
    }
}

gemo1011 · April 26, 2019, 8:25am

For me the output stays the same no matter how I execute the query.

Sprungwunder · April 26, 2019, 10:04am

Thanks for your tests, I can now confirm that the alternating output is caused by the virtual cluster setting we use in development. We changed it to a single node cluster and now the output stays the same

system · May 24, 2019, 10:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
New java client bug in GeoCentroidAggregation? Elasticsearch language-clients	4	309	September 19, 2022
Aggregation error with geohash_grid precision? Elasticsearch	1	594	July 5, 2017
how geo_distance query works under the hood in Elasticsearch? Elasticsearch	2	359	July 2, 2023
Not getting my geo_point Elasticsearch	6	1589	April 2, 2019
Elasticsearch geo search strange behavior Elasticsearch	8	920	January 12, 2018

Inaccuracy for Geo Centroid Aggregation and Geo Bounds Aggregation

Related topics