Java heap space when using unique count on specific field

Hello,
first of all, I am very new to Elasticsearch, which could be the main problem :slight_smile: .
My settings:
I am running single node Elasticsearch (5.2.2) and Kibana (5.2.2) in Docker containers. I also tried the same with 5.1.x (with the same result).
I have one index (let's say "my_index") with around 20 million documents (~7GB).
My problem:
I was running this on version 2 without bigger troubles, then upgraded and since then I suffer a lot from Java Heap Space - Out Of Memory.
My logs (docker logs -f elasticsearch) are full of warning about garbage collector: e.g. [gc][108] overhead, spent [1s] collecting in the last [1.2s]
At first I thought it will be just too low heap (althought machine where ES v2 was running had only 8GB), so I started to experimenting with it.
Right now it is set to 24GB (-Xms24G -Xmx24G). No matter what I set, it always ends in OOM after one specific attempt to create vizualization in Kibana.
As I've heard default settings of both Kibana and ES are quite all right, so I didn't change anything (although I was experimenting with it with no luck).
Example documents looks like this:

{
"_index" : "my_index",
"_type" : "event",
"_id" : "someID",
"_version" : 1,
"found" : true,
"_source" : {
"time" : {
"observation" : "2016-03-12T13:44:38+00:00",
"source" : "2016-03-12T13:58:30+00:00"
},
"feed" : {
"url" : "http://xxxx",
"accuracy" : 20.0,
"name" : "some_name"
},
"raw" : "some_hase",
"classification" : {
"type" : "",
"taxonomy" : ""
},
"source" : {
"geolocation" : {
"city" : "city",
"cc" : "country-code",
"latitude" : 154,
"longitude" : 254
},
"asn" : 1234,
"ip" : "255.255.255.255",
"network" : "255.255.255.255/12"
}
}
}

My problem is that I can't use unique count with source.ip for example in Kibana as Vizualize -> Histogram, where I want to have on Y-axis Unique count of field source.ip and X-ax shoud be source.ip with subbucket of feed name. Result shows which source.ip came to me from the most sources.
In general unique count is in my case problematic.

Whenever I try it, Kibana shows:
"Error: Request Timeout after 30000ms
at http://kibana_URL/bundles/kibana.bundle.js?v=14723:14:8629
at http://kibana_URL/bundles/kibana.bundle.js?v=14723:14:9050"

This shows in logs (docker logs -f elasticsearch):
http://pastebin.com/Eejst5g3
And also this:
http://pastebin.com/QVLB9ERA

This is settings of my index:
"settings" : {
"index" : {
"refresh_interval" : "5s",
"number_of_shards" : "1",
"provided_name" : "my_index",
"number_of_replicas" : "0",
"version" : {
"created" : "5010299",
"upgraded" : "5020299"
}

Healh of index:
"my_index" : {
"status" : "green",
"number_of_shards" : 1,
"number_of_replicas" : 0,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},

What should I try to fix this? Could anybody help, advice? I will provide additional information if needed.
Thx

The problem is caused by having many unique buckets (source IPs) and trying to count exactly how many things fall into each of these buckets. We can reduce the memory overhead by relaxing the need to count exactly.

The unique-count part of your query is executed under the covers by the cardinality aggregation [1] and by default it counts things accurately up to 3,000 unique values and then flips into a fuzzier (potentially inaccurate) mode for counting more values but using less memory. 3,000 unique values can take a lot of memory for each of the source_ips so we can reduce the memory cost but sacrifice some accuracy by reducing the precision_threshold setting to something significantly less than 3,000.

I had a similar query (not source-ips -> num feeds but reviewers-> num sellers) which also ran out of memory using a visualization like yours. By changing the "advanced" setting on the y-axis unique sellers count as in this screenshot I prevented the memory issue:

Tuning the numbers for your dataset will be walking a tricky line between getting accurate results and not running out of memory. Recognise that this is a challenging request to run in a single pass using aggregations on an event-centric index and sometimes entity-centric indexes make more sense when studying the behaviour of entities.

Cheers
Mark

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.