After Elasticsearch 7.10->7.16 Upgrade Geo Shape Queries Cause Heap Problem ( G1 Humongous Allocation )

Hi;
We have upgraded our ES Cluster from 7.10.0 to 7.16.2 . After upgrade , we have noticed heap usage increased a lot because of G1 Humongous Allocations. ES heap has a lot of Humongous regions.

When Heap usage reaches to %95, Parent Circuit Breaker prevents ES from out of memory error like below:

[2021-12-25T13:07:12,966][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [esdata01] attempting to trigger G1GC due to high heap usage [30646782976]
[2021-12-25T13:07:13,140][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [esdata01] GC did bring
memory usage down, before [30646782976], after [6917893816], allocations [69], duration [174]

After investigating issue, We saw the problem is related directly to Geo Shape Index and Queries.

{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"geo_shape": {
"geometry": {
"shape": {
"type": "Circle",
"radius": "100.0m",
"coordinates": [
${__javaScript((Math.random() * (360) - 180).toFixed(7) * 1,)},
${__javaScript((Math.random() * (180) - 90).toFixed(7) * 1,)}
]
},
"relation": "intersects"
}
}
}
}
}

It seems Elasticsearch changed Geo Shape Code implementation between ES 7.10 and 7.16 so it causes to critical heap issue now.

Our JVM setting has 3OGB Heap, other settings are default of ES.

Do you have any recommendation to fix this issue? or Is Elasticsearch aware of this issue and will fix in next releases?

Thanks;
Bülent

Hi @bulent,

That is an interesting error. In order to to reproduce it could you provide the following information?

  • Java version
  • Number of documents / shards on the indices you are querying
  • How is this query run, e.g on a loop a fixed number of times

That query seems part of some test suite, is that correct?

Thanks

Hi Ignacio;

We are using Java coming with Elasticsearch. So Jave version is 17.0.1 for Elasticsearch 7.16.2

Index Size is 8.5GB. ( 1 Primary + 5 Slave ). Document size is 354589

We have an application that Users uploads their photos from web or mobile clients. So our application makes query to Elasticsearch to find location of uploaded photos.

So query is part of production usage.

But also we can create same issue with load test by sending the following from Jmeter at our PreProd Env.

{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"geo_shape": {
"geometry": {
"shape": {
"type": "Circle",
"radius": "100.0m",
"coordinates": [
${__javaScript((Math.random() * (360) - 180).toFixed(7) * 1,)},
${__javaScript((Math.random() * (180) - 90).toFixed(7) * 1,)}
]
},
"relation": "intersects"
}
}
}
}
}
}

While running Jmeter tests (1000 loop X 200 Thread for instance ), we are monitoring ES logs & Heap Usage so problem occurs immediately after load test is run.

There was not any problem with Elasticsearch 7.10.0. But after upgrading ES to 7.16.2, problem appeared

Thanks;
Bülent

Thanks! I have not been able to reproduce it but still the info is pretty useful.

I am aware of a change in Lucene where in some cases we are build a BitSet when executing queries as the one above in order to speed them up. You might be hitting that change but it should not create humongous objects as the size of the BitSet is related to the number of documents. In your case the bites should use at most ~44KB which should not cause issues.

The other reason I can think of is that you have documents with humongous polygons and when retrieve them we are now allocating then in a high byte arrays. This is only speculation, could you run the test above using for example the _count api instead of _search to see if the error disappear?

Our documents are using MultiPolygon Types like below. It contains a lot of coordinates info.

{"id":5541483,"osm_type":"relation","type":"Feature","name":"address data","properties":{"name":"address data","boundary":"administrative","addr:city":"address data","admin_level":"10"},"geometry":{"type":"MultiPolygon","coordinates":[[[[29.099582199999997, ...}}

But it was not causing to any issue till upgrading ES from 7.10.0 to 7.16.2.

So propably some code changes happened about MultiPolygon Types. So querying an index containing a lot of documents with MultiPolygon Types causes to humongous objects.
( Or MultiPolygon Types affected indirectly by another code changes)

I wil check with our developers about how to re-run our load test with _count api instead of search, and inform you about it results.

Maybe a more efficient way to look into this issue is by create a heap dump when the issue is happening? would that be possible?

Sure, I can share heap dump.

I have reproduced the issue again at PreProd Env like below.

[2022-01-04T09:19:50.424+0000][10980][gc,heap ] GC(1000) Humongous regions: 380->16
[2022-01-04T09:19:50.685+0000][10980][gc,heap ] GC(1001) Humongous regions: 379->3
[2022-01-04T09:19:50.773+0000][10980][gc,heap ] GC(1002) Humongous regions: 194->13
[2022-01-04T09:19:50.885+0000][10980][gc,heap ] GC(1003) Humongous regions: 265->18
[2022-01-04T09:19:50.912+0000][10980][gc,heap ] GC(1005) Humongous regions: 23->13
[2022-01-04T09:19:51.105+0000][10980][gc,heap ] GC(1007) Humongous regions: 383->8
[2022-01-04T09:19:51.272+0000][10980][gc,heap ] GC(1008) Humongous regions: 382->14
[2022-01-04T09:19:51.410+0000][10980][gc,heap ] GC(1009) Humongous regions: 386->13
[2022-01-04T09:19:51.584+0000][10980][gc,heap ] GC(1010) Humongous regions: 388->24

[2022-01-04T12:19:27,910][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] attempting to trigger G1GC due to high heap usage [2077476072]
[2022-01-04T12:19:27,936][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] GC did bring memory usage down, before [2077476072], after [705796296], allocations [3], duration [26]
[2022-01-04T12:19:30,659][INFO ][o.e.m.j.JvmGcMonitorService] [data01] [gc][552] overhead, spent [278ms] collecting in the last [1s]
[2022-01-04T12:19:34,760][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] attempting to trigger G1GC due to high heap usage [2082090608]
[2022-01-04T12:19:34,772][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] GC did bring memory usage down, before [2082090608], after [644053840], allocations [1], duration [13]
[2022-01-04T12:19:42,020][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] attempting to trigger G1GC due to high heap usage [2042498592]
[2022-01-04T12:19:42,038][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] GC did bring memory usage down, before [2042498592], after [473900056], allocations [1], duration [17]
[2022-01-04T12:19:47,125][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] attempting to trigger G1GC due to high heap usage [2079387200]
[2022-01-04T12:19:47,137][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [data01] GC did bring memory usage down, before [2079387200], after [384999920], allocations [3], duration [13]

And took heap dump.

./jmap -dump:format=b,file=es_dump.hprof 10980
Dumping heap to Elasticsearch-7.16.2-data/jdk/bin/es_dump.hprof ...
Heap dump file created [1723810040 bytes in 3.114 secs]

I will share heapdump link with you by sending message to you. ( file size is ~1.7GB)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.