Unexpeted and not logged hanging

  1. Cluster - 4 nodes - 2*(master-no-data) + 2*(no-master-data);
  2. Data - not much - ~10000 small documents
  3. Uptime - ~month without restart, load - less 10 queries /index tasks per second
  4. OS - debian 8
    Unexpectly cluster became hanged with _search endpoint
    It keeps working with get document, it still work with _cat, but any _search hangs and freeze.

_cat/nodes - shown all nodes, _cat/health - all green.
In data-node's logs - no any errors, in master node - errors in transport with first data-node:
[2015-07-11 19:29:27,253][DEBUG][action.search.type ] [web1] [ullogin][1], node[5EQm6ReJRK6j6VByOis2og], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@14f73de] lastShard [true]
org.elasticsearch.transport.SendRequestTransportException: [unlift1][inet[/]][indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)

Restarting of 1-st data-node not resolve problem, restart of no-data nodes not resolve problem.
Cluster became valid ONLY just after restart 2-nd data-node.

So it's not much good situation that i have no any log info, no problem markers on _cat command, but cluster is unavailable by fact (cannot _search).

So the only option for now is to look over cluster manually and restart data nodes if they became invalid again.

I saw once something similar but with some plugins running. Do you have any plugin?

I have marvel , russian_morphology, self written tokenizer and _update_by_query and how i see knapsack installed too (but it was a week ago and it's not used).

By the time problems launches after _update_by_query installation (it was last installed) but it is used rarely and on manual manner, so it's very strange if it's cause.

Another thing that was not much before is that i have rewrite mapping for some indexes and get some strange erorrs:
if i drop mappings and then try to recreate them - it throws exceptions that it cannot found custom tokenizer, if i totally drop index, then recreate it, set mapping and just after that set "analyzer" options for index - all is well (mapping is working and tokenizer normally used).

Have investigate marvel reports more deeply.
Emmidiatly before hanging i have:

  1. grow up of search per second
  2. all thread-pools (all means search, index, worming and so on) goes to ZERO level
    And I wrote about plugins before (was not marked as reply)

tryed to check if it's something about thread pool overload with test. But it doesn't kill cluster. So load itself doesn't kill cluster or cause dropping of thread pool.

It's very strange. This event occured nearby one-two times every day. Next time will check _cat/thread_pool

Have found follwing

[2015-07-16 14:07:10,756][DEBUG][action.search.type ] [unlift2] [140968] Failed to execute fetch phase
org.elasticsearch.script.groovy.GroovyScriptExecutionException: IOException[Cannot run program "/tmp/xudp": error=26, Text file busy]; nested: IOException[error=26, Text file busy];
at org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:278)
at org.elasticsearch.search.fetch.script.ScriptFieldsFetchSubPhase.hitExecute(ScriptFieldsFetchSubPhase.java:74)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:194)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:504)
at org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:452)
at org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:449)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

After that all thread_pool on DATA2 was full.
May be it's some kind of attack because after checking i found that 9200 was open to internet in whole cluster (administrator mistake).