Elasticsearch multidimensional points and nearest neighbour sorting

Spansh · December 18, 2018, 4:29pm

I've been using Elasticsearch for quite a few years now and it powers one of my sites which deals with 3d points in space (effectively solar systems in the galaxy and the bodies within those solar systems).

It works mostly fine and I've upgraded Elasticsearch several times and the data size continues to grow (currently sitting at around 90million bodies).

However the most common use case is to sort by euclidean distance from a specific location in 3d space, so long as the filter stages reduce the data size by enough this is fine, however when a filter produces millions of results (or there is no filter) elasticsearch has to calculate the pythagorean distance for every record in order to return the top x nearest neighbours. The worst case currently takes 20 seconds (no filter, sort by distance) on my elasticsearch installation which is unacceptable for me.

I had been watching https://www.elastic.co/blog/lucene-points-6.0 with great anticipation but this seems to have gone by the wayside. There are several plugins which allow for nearest neighbour filtering (normally for classifying images). These end up using angular difference which whilst helpful isn't the same as euclidean distance so they don't really fit my use case.

I had considered using geohashing or subdividing the galaxy into quadrants and calculating distances from all of those up front, which would potentially work but would create a serious number of edge cases to handle or else the orders would end up wrong.

I know that the underlying lucene supports n dimensional kd-trees for the index and so I've eventually decided that adding a new datatype/index plugin to expose those for sorting might solve my issue, however I'm a little stuck on this part. There is documentation for plugin authors but after reading through it I'm not even sure where to start adding it. I did consider checking through the 2d geographic code which is already in elasticsearch as I'm guessing this already uses the kd-trees just specialised for 2d and as such may only require a small amount of modification to get them to work in n dimensions (in my case I only need 3 but I can forsee other people desiring more) but after grepping and perusing the source I struggled to find exactly where the lucene index type for geo points was defined and where it integrates with the sort routines.

Also if anyone has any other suggestions for improving this use case that don't include throwing more hardware at the problem I'd be interested, I've already done basic optimisations such as removing the sqrt (which is unneeded for sorting) from the pythagorean equation. The query I'm using is below and the relevant fields in that sort are all doubles (system_x, system y and system_z). The index itself is technically not updated live as I rebuild from scratch every day and switch so I can precalculate things is required.

{
   "_source" : true,
   "script_fields" : {
      "distance" : {
         "script" : {
            "lang" : "painless",
            "source" : "double distance = StrictMath.sqrt((-4960.75 - doc[\u0027system_x\u0027].value)*(-4960.75 - doc[\u0027system_x\u0027].value) + (298.53125 - doc[\u0027system_y\u0027].value)*(298.53125 - doc[\u0027system_y\u0027].value) + (-1441.4375 - doc[\u0027system_z\u0027].value)*(-1441.4375 - doc[\u0027system_z\u0027].value));return distance;"
         }
      }
   },
   "query" : {
      "bool" : {
         "must" : []
      }
   },
   "sort" : [
      {
         "_script" : {
            "script" : {
               "source" : "double x = (-4960.75 - doc[\u0027system_x\u0027].value);double y = (298.5325 - doc[\u0027system_y\u0027].value);double z = (-1441.4375 - doc[\u0027system_z\u0027].value);double distance = (x*x + y*y + z*z);return distance;",
               "lang" : "painless"
            },
            "type" : "number",
            "order" : "asc"
         }
      }
   ]
}

Ignacio_Vera · January 7, 2019, 1:13pm

Hi,

You can index dimensional points using Lucene, for example if you are using doubles you just need to index your 3D points using DoublePoint:

github.com

apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/DoublePoint.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.document;

import java.util.Arrays;
import java.util.Collection;

This file has been truncated. show original

The only query available for that type of index are dimensional ranges, in your case would be a cube in your three dimensional space. With Lucene 8.0 (which it will be released in early 2019) you can implement a feature query for nearest neighbour search. There is an implementation for LatLonPoint:

github.com

apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/LatLonPointDistanceFeatureQuery.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.document;

import java.io.IOException;
import java.util.Objects;

This file has been truncated. show original

Spansh · January 8, 2019, 9:18am

Well, that's not exactly what I wanted but still promising, just need to wait for Elastic to move to Lucene 8 then and I can look at implementing my plugin (or bite the bullet and do the whole thing myself but that would appear to be overkill).

system · February 5, 2019, 9:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sorting by proximity (nearest neighbors) in multi-dimensional space Elasticsearch	1	552	July 6, 2017
Elasticsearch and new KD trees Elasticsearch	1	1857	December 6, 2016
How are number fields and their BKD-trees represented internally? Elasticsearch	3	3725	July 5, 2017
Spatial queries with multiple locations per document Elasticsearch	9	1615	July 6, 2017
How elasticsearch index Geo-points? Elasticsearch	1	196	December 7, 2021

Elasticsearch multidimensional points and nearest neighbour sorting

Related topics