I've been using Elasticsearch for quite a few years now and it powers one of my sites which deals with 3d points in space (effectively solar systems in the galaxy and the bodies within those solar systems).
It works mostly fine and I've upgraded Elasticsearch several times and the data size continues to grow (currently sitting at around 90million bodies).
However the most common use case is to sort by euclidean distance from a specific location in 3d space, so long as the filter stages reduce the data size by enough this is fine, however when a filter produces millions of results (or there is no filter) elasticsearch has to calculate the pythagorean distance for every record in order to return the top x nearest neighbours. The worst case currently takes 20 seconds (no filter, sort by distance) on my elasticsearch installation which is unacceptable for me.
I had been watching https://www.elastic.co/blog/lucene-points-6.0 with great anticipation but this seems to have gone by the wayside. There are several plugins which allow for nearest neighbour filtering (normally for classifying images). These end up using angular difference which whilst helpful isn't the same as euclidean distance so they don't really fit my use case.
I had considered using geohashing or subdividing the galaxy into quadrants and calculating distances from all of those up front, which would potentially work but would create a serious number of edge cases to handle or else the orders would end up wrong.
I know that the underlying lucene supports n dimensional kd-trees for the index and so I've eventually decided that adding a new datatype/index plugin to expose those for sorting might solve my issue, however I'm a little stuck on this part. There is documentation for plugin authors but after reading through it I'm not even sure where to start adding it. I did consider checking through the 2d geographic code which is already in elasticsearch as I'm guessing this already uses the kd-trees just specialised for 2d and as such may only require a small amount of modification to get them to work in n dimensions (in my case I only need 3 but I can forsee other people desiring more) but after grepping and perusing the source I struggled to find exactly where the lucene index type for geo points was defined and where it integrates with the sort routines.
Also if anyone has any other suggestions for improving this use case that don't include throwing more hardware at the problem I'd be interested, I've already done basic optimisations such as removing the sqrt (which is unneeded for sorting) from the pythagorean equation. The query I'm using is below and the relevant fields in that sort are all doubles (system_x, system y and system_z). The index itself is technically not updated live as I rebuild from scratch every day and switch so I can precalculate things is required.
{
"_source" : true,
"script_fields" : {
"distance" : {
"script" : {
"lang" : "painless",
"source" : "double distance = StrictMath.sqrt((-4960.75 - doc[\u0027system_x\u0027].value)*(-4960.75 - doc[\u0027system_x\u0027].value) + (298.53125 - doc[\u0027system_y\u0027].value)*(298.53125 - doc[\u0027system_y\u0027].value) + (-1441.4375 - doc[\u0027system_z\u0027].value)*(-1441.4375 - doc[\u0027system_z\u0027].value));return distance;"
}
}
},
"query" : {
"bool" : {
"must" : []
}
},
"sort" : [
{
"_script" : {
"script" : {
"source" : "double x = (-4960.75 - doc[\u0027system_x\u0027].value);double y = (298.5325 - doc[\u0027system_y\u0027].value);double z = (-1441.4375 - doc[\u0027system_z\u0027].value);double distance = (x*x + y*y + z*z);return distance;",
"lang" : "painless"
},
"type" : "number",
"order" : "asc"
}
}
]
}