Java heap space when using unique count on specific field

madleeen · March 17, 2017, 3:03pm

Hello,
first of all, I am very new to Elasticsearch, which could be the main problem .
My settings:
I am running single node Elasticsearch (5.2.2) and Kibana (5.2.2) in Docker containers. I also tried the same with 5.1.x (with the same result).
I have one index (let's say "my_index") with around 20 million documents (~7GB).
My problem:
I was running this on version 2 without bigger troubles, then upgraded and since then I suffer a lot from Java Heap Space - Out Of Memory.
My logs (docker logs -f elasticsearch) are full of warning about garbage collector: e.g. [gc][108] overhead, spent [1s] collecting in the last [1.2s]
At first I thought it will be just too low heap (althought machine where ES v2 was running had only 8GB), so I started to experimenting with it.
Right now it is set to 24GB (-Xms24G -Xmx24G). No matter what I set, it always ends in OOM after one specific attempt to create vizualization in Kibana.
As I've heard default settings of both Kibana and ES are quite all right, so I didn't change anything (although I was experimenting with it with no luck).
Example documents looks like this:

{
"_index" : "my_index",
"_type" : "event",
"_id" : "someID",
"_version" : 1,
"found" : true,
"_source" : {
"time" : {
"observation" : "2016-03-12T13:44:38+00:00",
"source" : "2016-03-12T13:58:30+00:00"
},
"feed" : {
"url" : "http://xxxx",
"accuracy" : 20.0,
"name" : "some_name"
},
"raw" : "some_hase",
"classification" : {
"type" : "",
"taxonomy" : ""
},
"source" : {
"geolocation" : {
"city" : "city",
"cc" : "country-code",
"latitude" : 154,
"longitude" : 254
},
"asn" : 1234,
"ip" : "255.255.255.255",
"network" : "255.255.255.255/12"
}
}
}

My problem is that I can't use unique count with source.ip for example in Kibana as Vizualize -> Histogram, where I want to have on Y-axis Unique count of field source.ip and X-ax shoud be source.ip with subbucket of feed name. Result shows which source.ip came to me from the most sources.
In general unique count is in my case problematic.

Whenever I try it, Kibana shows:
"Error: Request Timeout after 30000ms
at http://kibana_URL/bundles/kibana.bundle.js?v=14723:14:8629
at http://kibana_URL/bundles/kibana.bundle.js?v=14723:14:9050"

This shows in logs (docker logs -f elasticsearch):
http://pastebin.com/Eejst5g3
And also this:

This is settings of my index:
"settings" : {
"index" : {
"refresh_interval" : "5s",
"number_of_shards" : "1",
"provided_name" : "my_index",
"number_of_replicas" : "0",
"version" : {
"created" : "5010299",
"upgraded" : "5020299"
}

Healh of index:
"my_index" : {
"status" : "green",
"number_of_shards" : 1,
"number_of_replicas" : 0,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},

What should I try to fix this? Could anybody help, advice? I will provide additional information if needed.
Thx

Mark_Harwood · March 20, 2017, 11:38am

The problem is caused by having many unique buckets (source IPs) and trying to count exactly how many things fall into each of these buckets. We can reduce the memory overhead by relaxing the need to count exactly.

The unique-count part of your query is executed under the covers by the cardinality aggregation [1] and by default it counts things accurately up to 3,000 unique values and then flips into a fuzzier (potentially inaccurate) mode for counting more values but using less memory. 3,000 unique values can take a lot of memory for each of the source_ips so we can reduce the memory cost but sacrifice some accuracy by reducing the precision_threshold setting to something significantly less than 3,000.

I had a similar query (not source-ips -> num feeds but reviewers-> num sellers) which also ran out of memory using a visualization like yours. By changing the "advanced" setting on the y-axis unique sellers count as in this screenshot I prevented the memory issue:

Tuning the numbers for your dataset will be walking a tricky line between getting accurate results and not running out of memory. Recognise that this is a challenging request to run in a single pass using aggregations on an event-centric index and sometimes entity-centric indexes make more sense when studying the behaviour of entities.

Cheers
Mark

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

system · April 17, 2017, 11:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregate query: Elasticsearch:java.lang.OutOfMemoryError: Java heap space Elasticsearch	8	1490	July 25, 2019
java.lang.OutOfMemoryError: Java heap space - GC overhead using visualizations Elasticsearch	16	5208	August 2, 2018
Elasticsearch:java.lang.OutOfMemoryError: Java heap space Elasticsearch	9	12902	October 23, 2017
Elastic node crashes after kibana query - with java.lang.OutOfMemoryError: Java heap space Elasticsearch	6	845	December 9, 2019
ElastisSearch JVM Heap running full Elasticsearch	6	510	March 6, 2022

Java heap space when using unique count on specific field

Related topics