Unexpeted and not logged hanging

comdiv · July 11, 2015, 3:17pm

Cluster - 4 nodes - 2*(master-no-data) + 2*(no-master-data);
Data - not much - ~10000 small documents
Uptime - ~month without restart, load - less 10 queries /index tasks per second
OS - debian 8
Problem.
Unexpectly cluster became hanged with _search endpoint
It keeps working with get document, it still work with _cat, but any _search hangs and freeze.

_cat/nodes - shown all nodes, _cat/health - all green.
In data-node's logs - no any errors, in master node - errors in transport with first data-node:
[2015-07-11 19:29:27,253][DEBUG][action.search.type ] [web1] [ullogin][1], node[5EQm6ReJRK6j6VByOis2og], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@14f73de] lastShard [true]
org.elasticsearch.transport.SendRequestTransportException: [unlift1][inet[/192.168.1.3:9300]][indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)

Restarting of 1-st data-node not resolve problem, restart of no-data nodes not resolve problem.
Cluster became valid ONLY just after restart 2-nd data-node.

So it's not much good situation that i have no any log info, no problem markers on _cat command, but cluster is unavailable by fact (cannot _search).

So the only option for now is to look over cluster manually and restart data nodes if they became invalid again.

dadoonet · July 11, 2015, 3:40pm

I saw once something similar but with some plugins running. Do you have any plugin?

comdiv · July 12, 2015, 4:26pm

I have marvel , russian_morphology, self written tokenizer and _update_by_query and how i see knapsack installed too (but it was a week ago and it's not used).

By the time problems launches after _update_by_query installation (it was last installed) but it is used rarely and on manual manner, so it's very strange if it's cause.

Another thing that was not much before is that i have rewrite mapping for some indexes and get some strange erorrs:
if i drop mappings and then try to recreate them - it throws exceptions that it cannot found custom tokenizer, if i totally drop index, then recreate it, set mapping and just after that set "analyzer" options for index - all is well (mapping is working and tokenizer normally used).

comdiv · July 15, 2015, 1:55pm

Have investigate marvel reports more deeply.
Emmidiatly before hanging i have:

grow up of search per second
all thread-pools (all means search, index, worming and so on) goes to ZERO level
And I wrote about plugins before (was not marked as reply)

tryed to check if it's something about thread pool overload with test. But it doesn't kill cluster. So load itself doesn't kill cluster or cause dropping of thread pool.

It's very strange. This event occured nearby one-two times every day. Next time will check _cat/thread_pool

comdiv · July 16, 2015, 9:50am

Have found follwing

[2015-07-16 14:07:10,756][DEBUG][action.search.type ] [unlift2] [140968] Failed to execute fetch phase
org.elasticsearch.script.groovy.GroovyScriptExecutionException: IOException[Cannot run program "/tmp/xudp": error=26, Text file busy]; nested: IOException[error=26, Text file busy];
at org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:278)
at org.elasticsearch.search.fetch.script.ScriptFieldsFetchSubPhase.hitExecute(ScriptFieldsFetchSubPhase.java:74)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:194)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:504)
at org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:452)
at org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:449)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

After that all thread_pool on DATA2 was full.
May be it's some kind of attack because after checking i found that 9200 was open to internet in whole cluster (administrator mistake).

Topic		Replies	Views
Request to elasticsearch cluster hangs Elasticsearch	1	1145	July 5, 2017
Elastic node hangs (i think) Elasticsearch	5	725	March 7, 2018
Cannot post new document to elasticsearch Elasticsearch	6	676	July 6, 2017
Elastic node hangs or stops after few hours of load Elasticsearch	10	5212	March 14, 2018
Hung node, cluster state green Elasticsearch	6	1157	July 6, 2017

Unexpeted and not logged hanging

Related topics