Elasticsearch multidimensional points and nearest neighbour sorting

I've been using Elasticsearch for quite a few years now and it powers one of my sites which deals with 3d points in space (effectively solar systems in the galaxy and the bodies within those solar systems).

It works mostly fine and I've upgraded Elasticsearch several times and the data size continues to grow (currently sitting at around 90million bodies).

However the most common use case is to sort by euclidean distance from a specific location in 3d space, so long as the filter stages reduce the data size by enough this is fine, however when a filter produces millions of results (or there is no filter) elasticsearch has to calculate the pythagorean distance for every record in order to return the top x nearest neighbours. The worst case currently takes 20 seconds (no filter, sort by distance) on my elasticsearch installation which is unacceptable for me.

I had been watching https://www.elastic.co/blog/lucene-points-6.0 with great anticipation but this seems to have gone by the wayside. There are several plugins which allow for nearest neighbour filtering (normally for classifying images). These end up using angular difference which whilst helpful isn't the same as euclidean distance so they don't really fit my use case.

I had considered using geohashing or subdividing the galaxy into quadrants and calculating distances from all of those up front, which would potentially work but would create a serious number of edge cases to handle or else the orders would end up wrong.

I know that the underlying lucene supports n dimensional kd-trees for the index and so I've eventually decided that adding a new datatype/index plugin to expose those for sorting might solve my issue, however I'm a little stuck on this part. There is documentation for plugin authors but after reading through it I'm not even sure where to start adding it. I did consider checking through the 2d geographic code which is already in elasticsearch as I'm guessing this already uses the kd-trees just specialised for 2d and as such may only require a small amount of modification to get them to work in n dimensions (in my case I only need 3 but I can forsee other people desiring more) but after grepping and perusing the source I struggled to find exactly where the lucene index type for geo points was defined and where it integrates with the sort routines.

Also if anyone has any other suggestions for improving this use case that don't include throwing more hardware at the problem I'd be interested, I've already done basic optimisations such as removing the sqrt (which is unneeded for sorting) from the pythagorean equation. The query I'm using is below and the relevant fields in that sort are all doubles (system_x, system y and system_z). The index itself is technically not updated live as I rebuild from scratch every day and switch so I can precalculate things is required.

{
   "_source" : true,
   "script_fields" : {
      "distance" : {
         "script" : {
            "lang" : "painless",
            "source" : "double distance = StrictMath.sqrt((-4960.75 - doc[\u0027system_x\u0027].value)*(-4960.75 - doc[\u0027system_x\u0027].value) + (298.53125 - doc[\u0027system_y\u0027].value)*(298.53125 - doc[\u0027system_y\u0027].value) + (-1441.4375 - doc[\u0027system_z\u0027].value)*(-1441.4375 - doc[\u0027system_z\u0027].value));return distance;"
         }
      }
   },
   "query" : {
      "bool" : {
         "must" : []
      }
   },
   "sort" : [
      {
         "_script" : {
            "script" : {
               "source" : "double x = (-4960.75 - doc[\u0027system_x\u0027].value);double y = (298.5325 - doc[\u0027system_y\u0027].value);double z = (-1441.4375 - doc[\u0027system_z\u0027].value);double distance = (x*x + y*y + z*z);return distance;",
               "lang" : "painless"
            },
            "type" : "number",
            "order" : "asc"
         }
      }
   ]
}

Hi,

You can index dimensional points using Lucene, for example if you are using doubles you just need to index your 3D points using DoublePoint:

The only query available for that type of index are dimensional ranges, in your case would be a cube in your three dimensional space. With Lucene 8.0 (which it will be released in early 2019) you can implement a feature query for nearest neighbour search. There is an implementation for LatLonPoint:

Well, that's not exactly what I wanted but still promising, just need to wait for Elastic to move to Lucene 8 then and I can look at implementing my plugin (or bite the bullet and do the whole thing myself but that would appear to be overkill).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.