Inaccuracy for Geo Centroid Aggregation and Geo Bounds Aggregation

Hello,

while testing some geo functions for Elasticsearch I discovered an inacurracy for the Geo Centroid Aggregation and the Geo Bounds Aggregation.

For test purpose I'm using the geonames dataset. An collection of around 11.8 million geopoints. I used the following query to compute the centroid of this dataset:

{
    "aggs" : {
        "centroid" : {
            "geo_centroid" : {
                "field" : "location" 
            }
        }
    }
}

I also checked for the centroid of the dataset with PostGIS and the following query:

SELECT avg(ST_X(the_geom)) as lon, avg(ST_Y(the_geom)) as lat FROM geonames

I compared the two centroids and saw that there is a difference of around 0,2487° longitude and around 0,1242° latitude. Calculating the distance between this two centroids with the haversin-formula I got a distance of nearly 28 kilometer.

So I was testing the Centroid Aggregation with some excerpts of the geonames dataset. Here is an output for 6 points where you can see the inaccuracy of the centroid (calculating the average for lat and lon):

{
  "took" : 52,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "0LM0LmkBrQo0YN4q8mim",
        "_score" : 1.0,
        "_source" : {
          "id" : "3205376",
          "name" : "Metohija",
          "location" : {
            "lat" : "42.84111",
            "lon" : "17.63361"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "7LM0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286751",
          "name" : "Ledinići",
          "location" : {
            "lat" : "42.85083",
            "lon" : "17.62"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "-LM0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286785",
          "name" : "Boljenovići",
          "location" : {
            "lat" : "42.84806",
            "lon" : "17.62472"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "-bM0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286786",
          "name" : "Gornje Selo",
          "location" : {
            "lat" : "42.84611",
            "lon" : "17.63194"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "_7M0LmkBrQo0YN4q8pmp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3286800",
          "name" : "Bojnoge",
          "location" : {
            "lat" : "42.84472",
            "lon" : "17.63389"
          }
        }
      },
      {
        "_index" : "geonames",
        "_type" : "doc",
        "_id" : "wrM0LmkBrQo0YN4q8pyp",
        "_score" : 1.0,
        "_source" : {
          "id" : "3288000",
          "name" : "Ponikve",
          "location" : {
            "lat" : "42.84472",
            "lon" : "17.61306"
          }
        }
      }
    ]
  },
  "aggregations" : {
    "centroid" : {
      "location" : {
        "lat" : 42.84592493902892,
        "lon" : 17.62620317749679
      },
      "count" : 6
    }
  }
}

Here is a plot where the inaccuracy I got with my testing querys is visualized:
plot_discuss_elastic

I also got this kind of inaccuracy when testing the Geo Bounds Aggregation (coordinates of top_left and bottom_right are inaccurate). So basically this aggregations uses average (centroid) and min/max (bounds). So where is this coming from ? I was looking a bit at the source code and saw something about decoding and encoding the coordinates for calculating the aggregations, maybe it's because of this ?

The effect you are explaining here is due to the way Elasticsearch ( more exactly Lucene) index points. Instead of index the points using doubles, it encodes the latitude and longitude using integers. This strategy reduces the size of the index but it adds a small inaccuracy of around 1e-7 degrees (around 1cm on the surface of the earth).

This is the inaccuracy I expect you see in the geo bounds aggregation. For the centroid the inaccuracy is magnified because when calculating the moving average for the latitude and longitude, the logic seems to encode and decode the partial results.

1 Like

FYI: https://github.com/elastic/elasticsearch/issues/41032

1 Like

Nice, thank you !

edit:
I can't reproduce this on our production stack, maybe this is a side effect of our development elastic "virtual" cluster running 3 ES instances on the same VM.
The developemt data restored on the production cluster leads to the same result for each replayed query, so everything fine and as expoected.
/edit

This may be a related problem:
What I see is that results seem to alternate between two values for exact the same query.
If I replay the same query in Kibana I constantly get two different results with changes up to 1e-7. That means:

  1. Query A -> Result A
  2. Query A -> Result A'
  3. Query A -> Result A
  4. Query A -> Result A'
    and so on.

If I mix two queries, I get the same results:

  1. Query A -> Result A
  2. Query B -> Result B
  3. Query A -> Result A
  4. Query B -> Result B
  5. Query C -> Result C
  6. Query A -> Result A'
  7. Query B -> Result B'
  8. Query B -> Result B

Means: If I replay an even number of queries, I get constant results, If I replay an odd number, I get alternating results.

Does the mentoined fix also fix this behaviour?

POST geo_data/_search?size=0
{
  "aggs": {
    "location": {
      "geo_centroid": {
            "field": "position"
          }
    }
  }
}

Response either

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 19434896,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "location" : {
      "location" : {
        "lat" : 1.2363329981435465,
        "lon" : 103.64341515934639
      },
      "count" : 64802
    }
  }
}

or

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 19434896,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "location": {
            "location": {
                "lat": 1.2363330319779686,
                "lon": 103.64341576836598
            },
            "count": 64802
        }
    }
}
1 Like

For me the output stays the same no matter how I execute the query.

1 Like

Thanks for your tests, I can now confirm that the alternating output is caused by the virtual cluster setting we use in development. We changed it to a single node cluster and now the output stays the same :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.