Hello,
while testing some geo functions for Elasticsearch I discovered an inacurracy for the Geo Centroid Aggregation and the Geo Bounds Aggregation.
For test purpose I'm using the geonames dataset. An collection of around 11.8 million geopoints. I used the following query to compute the centroid of this dataset:
{
"aggs" : {
"centroid" : {
"geo_centroid" : {
"field" : "location"
}
}
}
}
I also checked for the centroid of the dataset with PostGIS and the following query:
SELECT avg(ST_X(the_geom)) as lon, avg(ST_Y(the_geom)) as lat FROM geonames
I compared the two centroids and saw that there is a difference of around 0,2487° longitude and around 0,1242° latitude. Calculating the distance between this two centroids with the haversin-formula I got a distance of nearly 28 kilometer.
So I was testing the Centroid Aggregation with some excerpts of the geonames dataset. Here is an output for 6 points where you can see the inaccuracy of the centroid (calculating the average for lat and lon):
{
"took" : 52,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 1.0,
"hits" : [
{
"_index" : "geonames",
"_type" : "doc",
"_id" : "0LM0LmkBrQo0YN4q8mim",
"_score" : 1.0,
"_source" : {
"id" : "3205376",
"name" : "Metohija",
"location" : {
"lat" : "42.84111",
"lon" : "17.63361"
}
}
},
{
"_index" : "geonames",
"_type" : "doc",
"_id" : "7LM0LmkBrQo0YN4q8pmp",
"_score" : 1.0,
"_source" : {
"id" : "3286751",
"name" : "Ledinići",
"location" : {
"lat" : "42.85083",
"lon" : "17.62"
}
}
},
{
"_index" : "geonames",
"_type" : "doc",
"_id" : "-LM0LmkBrQo0YN4q8pmp",
"_score" : 1.0,
"_source" : {
"id" : "3286785",
"name" : "Boljenovići",
"location" : {
"lat" : "42.84806",
"lon" : "17.62472"
}
}
},
{
"_index" : "geonames",
"_type" : "doc",
"_id" : "-bM0LmkBrQo0YN4q8pmp",
"_score" : 1.0,
"_source" : {
"id" : "3286786",
"name" : "Gornje Selo",
"location" : {
"lat" : "42.84611",
"lon" : "17.63194"
}
}
},
{
"_index" : "geonames",
"_type" : "doc",
"_id" : "_7M0LmkBrQo0YN4q8pmp",
"_score" : 1.0,
"_source" : {
"id" : "3286800",
"name" : "Bojnoge",
"location" : {
"lat" : "42.84472",
"lon" : "17.63389"
}
}
},
{
"_index" : "geonames",
"_type" : "doc",
"_id" : "wrM0LmkBrQo0YN4q8pyp",
"_score" : 1.0,
"_source" : {
"id" : "3288000",
"name" : "Ponikve",
"location" : {
"lat" : "42.84472",
"lon" : "17.61306"
}
}
}
]
},
"aggregations" : {
"centroid" : {
"location" : {
"lat" : 42.84592493902892,
"lon" : 17.62620317749679
},
"count" : 6
}
}
}
Here is a plot where the inaccuracy I got with my testing querys is visualized:
I also got this kind of inaccuracy when testing the Geo Bounds Aggregation (coordinates of top_left and bottom_right are inaccurate). So basically this aggregations uses average (centroid) and min/max (bounds). So where is this coming from ? I was looking a bit at the source code and saw something about decoding and encoding the coordinates for calculating the aggregations, maybe it's because of this ?