A few general questions about Elasticsearch

arisbanach · February 19, 2018, 3:27am

I've been curious about a few aspects of Elasticsearch I was hoping I could get some answers to:

What is the best way to deal with queries that bring down the entire cluster by causing Elasticsearch to run out of memory? I have a cluster of three master-eligible nodes and it goes down for this reason sometimes. Is this a reason to have a dedicated master node?
The docs for Elasticsearch talk about the benefits of clustering, but they only mention it solving the problem of when a node goes offline for some reason. However, I've had my cluster go down pretty much just from running out of memory. Because the cluster distributes the queries to each node, the whole thing goes down instead of just one or two nodes. So does the clustering benefit only apply to other types of issues?
Are there benefits to a distributed architecture other than providing a solution for if a node goes down? I know it lets you scale horizontally as well, but does it allow better performance in some ways too?
When my cluster goes down from running out of memory, I have been SSHing into each node and restarting the service. That works fine, but I'm wondering if there is a better way. It would be nice to do that all at once.

Any help would be greatly appreciated. Thank you!

warkolm · February 19, 2018, 3:29am

What are your queries? Are you getting OOM in the logs, can you share the log? What version are you on? What heap size? Java version, OS? What are you using for monitoring things?

arisbanach · February 19, 2018, 4:03am

They were fairly large queries. One, for example, was for an aggregation of unique terms with a cardinality of about 1.3 million. I had previously broken it up using partitions and had my client code cycle through them and do the math later, but it would be nice to just have it given all at once.

I don't have the logs in front of me, but when the cluster goes down I SSH into a node and the Elasticsearch logs have a JAVA HEAP error there in regard to memory. I'm on Elasticsearch 6.2, JDK 1.8, Ubuntu 16, total of 4GB per node and have Elasticsearch memory locked at 2GB. I know that's under recommendations for the memory size but I cant currently scale up at all for... reasons.

We're mainly just beta testing now so the monitoring is pretty much "hey my query didn't work. yep, Elasticsearch is down."

warkolm · February 19, 2018, 4:07am

You probably already know this, but;

And;

Won't play well.

It'd be super handy if you had logs available, as the circuit breakers should have stopped this. But my guess is that it's OOMing before it even hits that part while it tries to figure if it'll cross the breakers. Which is a catch-22, and a theory only.

arisbanach · February 19, 2018, 2:47pm

Yeah, I could guess =) Not up to me how much memory we have though.

Here are some example logs:

[2018-02-19T14:19:34,689][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch1] [gc][old][208923][2234] duration [7s], collections [1]/[7.1s], total [7s]/[25.8m], memory [1.9gb]->[1.9gb]/[1.9gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [14.7mb]->[14.4mb]/[16.6mb]}{[old] [1.8gb]->[1.8gb]/[1.8gb]}
[2018-02-19T14:19:34,689][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch1] [gc][208923] overhead, spent [7s] collecting in the last [7.1s]
[2018-02-19T14:19:39,721][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch1] [gc][208924] overhead, spent [4.9s] collecting in the last [5s]
[2018-02-19T14:19:39,722][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [elasticsearch1] fatal error in thread [elasticsearch[elasticsearch1][management][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
[2018-02-19T14:19:34,713][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [elasticsearch1] fatal error in thread [Thread-73902], exiting
java.lang.OutOfMemoryError: Java heap space
	at io.netty.buffer.PooledHeapByteBuf$1.newObject(PooledHeapByteBuf.java:34) ~[?:?]
	at io.netty.buffer.PooledHeapByteBuf$1.newObject(PooledHeapByteBuf.java:31) ~[?:?]
	at io.netty.util.Recycler.get(Recycler.java:148) ~[?:?]
	at io.netty.buffer.PooledHeapByteBuf.newInstance(PooledHeapByteBuf.java:39) ~[?:?]
	at io.netty.buffer.PoolArena$HeapArena.newByteBuf(PoolArena.java:702) ~[?:?]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:145) ~[?:?]
	at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:307) ~[?:?]
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:162) ~[?:?]
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:153) ~[?:?]
	at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:135) ~[?:?]
	at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

warkolm · February 19, 2018, 8:46pm

Ok, so high GC and then OOM.

Can you share the query?

arisbanach · February 19, 2018, 9:07pm

I'm not sure if I can, but it is a date histogram aggregation of one month split into weeks with a sub aggregation per week of unique ip addresses. I'm still a relative noob to Elasticsearch and "big data" so I've been wondering if I'm correct in thinking this is a large query or not, since "large" is relative. Is this something a normal Elasticsearch cluster should be able to handle, given a an ip address cardinality of about 1.5 million per month? Or is it normal for most organizations to have to split up their aggregations into partitions for this stuff?

Also, I do totally understand the query is too memory intensive for the hardware our cluster is running on. I am just more curious in learning about Elasticsearch itself and why the cluster goes down from it instead of the circuit breakers stopping it?

If you'd like the specific query, I can send it to you via PM if that's okay.

arisbanach · February 20, 2018, 5:44pm

From reading more in the docs:

Consider the performance implications of multiple tenants, a weakness or a bad query in one can bring down an entire cluster!

So I take this to mean that there aren't actually safeguards in place to prevent a cluster going down from an intensive query?

arisbanach · February 26, 2018, 11:22pm

Bump. Mainly just trying to understand if a circuit breaker (I'm not familiar with these) should prevent a cluster from going down because of a heavy query causing OOM errors.

warkolm · March 8, 2018, 11:30pm

There are, circuit breakers.

However if this is not happening then perhaps you can raise a GitHub issue with as much info as possible so the team can dig into it.

Christian_Dahlqvist · March 9, 2018, 4:00am

How much data do you have in the cluster? How many indices/shards? What is the output of the cluster stats API?

Cardinality aggregations can be memory hungry. Can you share what your query/dashboardlooks like?

arisbanach · March 9, 2018, 2:31pm

Do these stats affect whether the circuit breaker works or not? I totally understand our cluster is underpowered and that the queries we ran brought it down because of that. I'm just curious why the circuit breaker didn't stop it from going down though.

Christian_Dahlqvist · March 9, 2018, 2:53pm

The stats would help show how much heap space is taken up by overhead. If this is high I suspect you might need to tune the circuit breaker settings.

arisbanach · March 9, 2018, 3:07pm

So we have 3 servers with 4GB each, memory locked for Elasticsearch to 2GB max and min. The queries vary but they are all fairly large queries with aggregations and high cardinality (IP addresses, for example, usually about 1.5 million).

How can I tune the circuit breaker settings so that I can be sure they'll cut in and prevent the cluster from going down? Is there any math I can do to figure out what a good setting for this is?

Thank you.

system · April 6, 2018, 3:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance degrading after a couple of weeks Elasticsearch	7	526	October 30, 2018
Elasticsearch cluster down due to Elasticsearch:java.lang.OutOfMemoryError: Java heap space Elasticsearch	4	381	June 25, 2019
Help guide a noob for optimizing elasticsearch in current cluster Elasticsearch	6	615	December 6, 2018
Elastic Cluster Went Down Elasticsearch	5	635	September 19, 2019
Cluster down after an autoreboot? Elasticsearch	5	576	March 8, 2018

A few general questions about Elasticsearch

Related topics