Hi. First of all , let me state , I'm a developer that read/write docs to elastic - but dont necessarily know all the ins and outs of elasticsearch setup/management.
Secondly - I know its a long post , thanx for taking the time to read... I'm out of ideas.
I have a 2 node cluster running Elasticsearch 1.7.3.
Each node has 24 cores and 96 and 64gb ram respectively with ~30GB assigned to the jvm heap... they are connected back2back with 1gb copper.
I started experiencing increased volume on my application and although I can still process and index the data , the cluster crashes or grinds to a halt as soon as I try to query anything (or simply opening kibana discover).
I have dropped off old indexes to bring my total indexes, shards and docs down to about 25% of what I had before the increased volume ... now sitting at 132 indexes , 327 shards and 1 300 000 000 docs (only 360GBs) - but the problem persists.
I have 0 replications and have played around with anything from 2 to 8 shards per index.
The log file shows several CircuitBreaking and IndexShardCreation exceptions ...
[2017-06-30 08:45:50,821][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:45:45.492Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]
[2017-06-30 08:45:50,825][DEBUG][action.search.type ] [Lee Forrester] [flows-2017-06-30][3], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@461906db] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [White Pilgrim][inet[/100.100.100.1:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: [flows-2017-06-30][3]: query[ConstantScore(BooleanFilter(+QueryWrapperFilter(ConstantScore(:)) +cache(timestamp:[1498790735292 TO 1498805135292])))],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:163)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [REQUEST] Data too large, data for [<reused_arrays>] would be larger than limit of [21533831987/20gb]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:97)
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:148)
...
after which it dies or just goes into a loop of
[2017-06-30 08:45:55,861][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:45:50.831Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]
[2017-06-30 08:46:00,914][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:45:55.879Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]
[2017-06-30 08:46:05,966][WARN ][cluster.action.shard ] [Lee Forrester] [flows-2017-06-29][6] received shard failed for [flows-2017-06-29][6], node[UQCJuJFMQoWOW6H-y67DSw], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-30T06:46:00.934Z], details[shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]], indexUUID [zLqHbRguSbSSN8EG9qbGCA], reason [shard failure [failed to create shard][IndexShardCreationException[[flows-2017-06-29][6] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [flows-2017-06-29][6], timed out after 5000ms]; ]]
In a last ditch effort I had raised the breaker limits to
indices.fielddata.cache.size: 75%
indices.breaker.fielddata.limit: 85%
indices.breaker.request.limit: 65%
indices.breaker.total.limit: 85%
but it still happens - just with a higher value.
As I understand , event though its not the default setting , it would still help (in terms of Heap usage) to store doc values - so I had changed my mapping to store doc_values for all but analysed strings.