Elastic cluster keeps crashing/hanging

Hi. First of all , let me state , I'm a developer that read/write docs to elastic - but dont necessarily know all the ins and outs of elasticsearch setup/management.
Secondly - I know its a long post , thanx for taking the time to read... I'm out of ideas.

I have a 2 node cluster running Elasticsearch 1.7.3.
Each node has 24 cores and 96 and 64gb ram respectively with ~30GB assigned to the jvm heap... they are connected back2back with 1gb copper.

I started experiencing increased volume on my application and although I can still process and index the data , the cluster crashes or grinds to a halt as soon as I try to query anything (or simply opening kibana discover).
I have dropped off old indexes to bring my total indexes, shards and docs down to about 25% of what I had before the increased volume ... now sitting at 132 indexes , 327 shards and 1 300 000 000 docs (only 360GBs) - but the problem persists.
I have 0 replications and have played around with anything from 2 to 8 shards per index.

The log file shows several CircuitBreaking and IndexShardCreation exceptions ...

[2017-06-30 08:45:50,821][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:45:45.492Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]
[2017-06-30 08:45:50,825][DEBUG][action.search.type ] [Lee Forrester] [flows-2017-06-30][3], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@461906db] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [White Pilgrim][inet[/100.100.100.1:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: [flows-2017-06-30][3]: query[ConstantScore(BooleanFilter(+QueryWrapperFilter(ConstantScore(:)) +cache(timestamp:[1498790735292 TO 1498805135292])))],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:163)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [REQUEST] Data too large, data for [<reused_arrays>] would be larger than limit of [21533831987/20gb]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:97)
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:148)
...
after which it dies or just goes into a loop of
[2017-06-30 08:45:55,861][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:45:50.831Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]
[2017-06-30 08:46:00,914][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:45:55.879Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]
[2017-06-30 08:46:05,966][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:46:00.934Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]

In a last ditch effort I had raised the breaker limits to

indices.fielddata.cache.size: 75%
indices.breaker.fielddata.limit: 85%
indices.breaker.request.limit: 65%
indices.breaker.total.limit: 85%
but it still happens - just with a higher value.

As I understand , event though its not the default setting , it would still help (in terms of Heap usage) to store doc values - so I had changed my mapping to store doc_values for all but analysed strings.

Can you upgrade?

Not easily , as I had to compile some custom code into Kibana for my purposes - and since Kibana 4.1 is not compatible (nor is my 1.7 indexes - but that I can reindex) with Elastic Stack 5.x it will require significant developement time ; while my cluster has now essentially been down for almost 2 weeks.

However if you tell me that 5.0 will fix my problem , because X and or Y has been addressed - then I can go that route.

P.S. Porting to 5.x has been on my todo list , but not highest priority just yet.

I won't do that specifically :p, but you're running a version that is no longer supported. There are heaps of improvements around resource efficiency in 2.X and 5.X, so we always suggest that.

To get to your problem though, you may want to add more resources to cope with the circuit breakers. ie more nodes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.