GeoShape: Consuming more heap on data node

Version 7.1.1

My mapping setting consists of following:

"geometry": {
"type": "geo_shape",
"tree": "quadtree",
"precision": "8m"
}

When i check /_segments?verbose=true, i can see my maximum memory_in_bytes (memory occupied on heap) is occupied by geometry field. All other text fields occupy very less heap.

I understand that geo_shape comes at cost of more memory and disk space.

Query:
Do we have a way wherein we can store this in a compressed format and still be able to query ?

Hi,

Have you tried the new indexing strategy introduced in ES 6.6.0? It still have some limitations but the cost of memory and disk space should be much lower.

Hi @Ignacio_Vera,
Thanks for your reply.
I have seen this strategy but have not tried it yet.

The reason being, we had current strategy since 6.4.1 and gave us accurate results.

Can you give an insight about how the 6.6.1 strategy differs from the one we have used, in terms of accuracy ?

The recursive strategy is based on describing the shape using the grid provided (in your case a quad tree). That means the logic computes all the cells that intersects with the indexed shapes at the given precision and stores that information in the inverted index.

Every cell is described as a prefix path and that goes into the terms dictionary. The higher the precision the more cells you need to describe your shape and the longer those paths will be. This dictionary is loaded into heap so that is the reason you see high heap usage for that field. Unfortunately the only ways to decrease heap usage would be to index your shapes at a lower precision, either using precision or distance_err_pct parameters.

The new indexed strategy is based on Lucene's BKD tree. Shapes are vectorised using triangles and stored in the tree as a bounding box plus some extra information that helps reconstructing the original triangle. The precision of the shapes is only limited to the encoding used for storing those vectors (1e-7 decimal degree precision).

The result is much faster indexing throughput, smaller index, smaller heap footprint and in most cases faster query throughput. And there is no need to set any extra parameter in order to get your data loaded into ES :).

You are awesome. Thanks :ok_hand:

1 Like