Shards/routing based on geohash


(Bela Boros) #1

Hi,
Please help me,
I would like to search based on geo location efficiently.

For that I would like to create shards/routing based on geohash.
(Suggestion: Sharding of geo_points)

The documentation does not contain example how to do sharding based on geohash.

Can you help me what is the problem with the example below?

Mapping:

{
	"settings" : {
		"number_of_shards" : 10,
		"number_of_replicas" : 0
	},
	
	
	"mappings" : {
        "transaction" : {
            "_routing" : {
                "required" : true,
                "path" : "location.geohash"
            },
		    "properties" : {
		        "_all" : {"enabled" : true},
		        "brand" : { "type":"String"},
		        "location" : { 
		            "type":"geo_point", 
		            "lat_lon":true,  
		            "geohash":true, 
		            "geohash_prefix":true, 
		            "geohash_precision": "50km" 
		        },
		        "value" : { "type":"float"},
		        "time" : { "type":"date" , "format":"YYYY-MM-dd HH:mm:ss"  }
		    }
	    }
	}
}

Cannot insert a document:

{
    "brand" : "Tesco",
    "location" : {
        "lat" : 30,
        "lon" : 30
    },
    "amount": 10
}

The error message says:

{
  "error": "RoutingMissingException[routing is required for [shopping]/[transaction]/[null]]",
  "status": 400
}

The documentation says, that ElasticSearch is able to "extract routing automatically" from the document.

Please help how to do sharding/routing based on geohash correctly.

Thanks
Béla


(Colin Goodheart-Smithe) #2

While its true that Elasticsearch can extract routing values from a field in the document, it is not recommended as it requires the coordinating node (the node which receives your request) to parsed the entire document's JSON to find the routing value before it can route it to the correct shard where the document will be re-parsed to be indexed. This makes extracting the routing value from the document very inefficient and this functionality (extracting routing from the document) will actually be removed in the upcoming 2.0 release. Also, the routing value that is extracted is only the value you pass in your request not the tokenised value that will be indexed so even with this approach elasticsearch would not be able to calculate the geohash form the geo_point field in your document and use that as the routing value.

That said you can implement this approach by calculating the geohash in your application and passing that to Elasticsearch as the routing value (Elasticsearch will take any string value as the routing value). Some thoughts on this approach:

  • Routing values need to be picked carefully to avoid unbalanced shard size. If you route on a fixed geohash length (e.g. geohashes of length 4) you will be splitting the world into equal tiles, but these tiles will probably not contain equal number of documents because the density of transactions (it seems like your data is shopping transactions) is not the same in a remote town in Siberia as in London UK. Because of this the logic in your application that works would need to pick routing values which are longer geohashes for more densely populated tiles (in terms of number of transactions) and shorter geohashes for less densely populated tiles.
  • When you perform searches your application would need to pick the routing values intelligently too. You cannot simply pick the geohash that contains the point you are looking for as you may miss documents which are in the search area but not on the same routing tile. Your application would need to work out all the routing geohashes for the query area and pass all of these as a comma separated routing value to ensure all appropriate shards are searched. In practice this may actually end up including all shards for a proportional of your searches.

Hope this helps


(ed) #3

Hi, I'm doing something similar, except I've decided to represent a square polygon on the map with an ID (e.g. 1, 2, 3, a, b, c etc). I've been noticing as I increase shard count, I see more collisions. Is there any advice on what kind of ID's I can use to ensure minimal or no collisions? For instance, given 48 shards, 4 shards have collisions and 4 shards are empty. It is my understanding ES uses DJB hash. However, I do not wish to generate ID's based on a specific hash function, as I know this function could change over time. Would UUID's work better in general than simply using sequential id's? Or is there something else I can do to minimize collisions? Thanks


(Colin Goodheart-Smithe) #4

What do you mean by collisions here?


(system) #5