Elasticsearch is taking 2 hours to resart

veve90 · August 30, 2016, 9:35am

Hello,

When restaring elasticsearch I first get: SERVICE_UNAVAILABLE/1/state not recovered / initialized
Then, after a while I get: Request timeout after 30000ms.

This whole process takes about 2 hours.

I have a 30 GB EC2, and about 500 GB data in /var/lib
I have indexes by date (dates since 1st of january).

Is there a way to get this time down?

Thank you!

jasontedor · August 30, 2016, 9:55am

What version of Elasticsearch, OS, and JVM; and what plugins do you have installed?

Can you take a thread dump using jstack and post the results here so that we can see where shutdown is hanging?

veve90 · August 30, 2016, 11:16am

Elasticsearch 5.0.0-alpha1
OS: CentOS 7
I have no plugins installed, other by the default ones

jstack -m

ugger attached successfully.
Server compiler detected.
JVM version is 25.65-b01
Deadlock Detection:

No deadlocks found.

......

----------------- 23271 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac51a4f13    _ZN2os5sleepEP6Threadlb + 0x283
0x00007fbac4fa6e22    JVM_Sleep + 0x3b2
0x00007fbab16cbb31    <Unknown compiled code>
0x00007fbab09ad98d    * org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run() bci:21 line:758 (Interpreted frame)
0x00007fbab09a64e7    <StubRoutines>
0x00007fbac4f13be6    _ZN9JavaCalls11call_helperEP9JavaValueP12methodHandleP17JavaCallArgumentsP6Thread + 0x1056
0x00007fbac4f140f1    _ZN9JavaCalls12call_virtualEP9JavaValue11KlassHandleP6SymbolS4_P17JavaCallArgumentsP6Thread + 0x321
0x00007fbac4f14597    _ZN9JavaCalls12call_virtualEP9JavaValue6Handle11KlassHandleP6SymbolS5_P6Thread + 0x47

...

0x00007fbac5958863    __GI_epoll_wait + 0x33
0x00007fbab0ae25f2    <Unknown compiled code>
----------------- 23311 -----------------

.....

0x00007fbac5958863    __GI_epoll_wait + 0x33
0x00007fbab09bb6d4    * sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) bci:0 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.EPollArrayWrapper.poll(long) bci:18 line:269 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.EPollSelectorImpl.doSelect(long) bci:28 line:79 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.SelectorImpl.lockAndDoSelect(long) bci:37 line:86 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.SelectorImpl.select(long) bci:30 line:97 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.SelectorImpl.select() bci:2 line:101 (Interpreted frame)
0x00007fbab09ad3d0    * org.jboss.netty.channel.socket.nio.NioServerBoss.select(java.nio.channels.Selector) bci:1 line:163 (Interpreted frame)
0x00007fbab09ad3d0    * org.jboss.netty.channel.socket.nio.AbstractNioSelector.run() bci:56 line:212 (Interpreted frame)
0x00007fbab09ad98d    * org.jboss.netty.channel.socket.nio.NioServerBoss.run() bci:1 line:42 (Interpreted frame)
0x00007fbab09ad9d2    * org.jboss.netty.util.ThreadRenamingRunnable.run() bci:55 line:108 (Interpreted frame)
0x00007fbab09ad9d2    * org.jboss.netty.util.internal.DeadLockProofWorker$1.run() bci:14 line:42 (Interpreted frame)
0x00007fbab09ad9d2    * ....
----------------- 23343 -----------------
0x00007fbab163da5a    <Unknown compiled code>
----------------- 23348 -----------------
0x00007fbac5958863    __GI_epoll_wait + 0x33
0x00007fbab0ae25f2    <Unknown compiled code>

...

 ----------------- 23365 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>

...

----------------- 26364 -----------------
0x00007fbac60486d5    __pthread_cond_wait + 0xc5
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>
0x00007fbab1f58f2c    * java.util.concurrent.locks.LockSupport.park(java.lang.Object) bci:14 line:175 (Compiled frame)
* java.util.concurrent.LinkedTransferQueue.awaitMatch(java.util.concurrent.LinkedTransferQueue$Node, java.util.concurrent.LinkedTransferQueue$Node, java.lang.Object, boolean, long) bci:184 line:737 (Interpreted frame)
0x00007fbab21c6d3c    * java.util.concurrent.LinkedTransferQueue.xfer(java.lang.Object, boolean, int, long) bci:286 line:647 (Compiled frame)
* java.util.concurrent.LinkedTransferQueue.take() bci:5 line:1269 (Compiled frame)
* java.util.concurrent.ThreadPoolExecutor.getTask() bci:149 line:1067 (Interpreted frame)
0x00007fbab09ad710    * java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
0x00007fbab09ad98d    * ...
----------------- 26374 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>
----------------- 26375 -----------------
..................................

----------------- 5096 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>
----------------- 23234 -----------------
0x00007fbac6045ef7    pthread_join + 0xa7

Is this what you needed?(I don't get a lot of things from this stack/ I couldn't post it all since is too big)

Thank you!

veve90 · August 30, 2016, 12:56pm

Here are some other stats on the cluster that may help..

{
   "cluster_name": "elk-cluster",
   "status": "red",
   "timed_out": false,
   "number_of_nodes": 1,
   "number_of_data_nodes": 1,
   "active_primary_shards": 1451,
   "active_shards": 1451,
   "relocating_shards": 0,
   "initializing_shards": 0,
   "unassigned_shards": 205,
   "delayed_unassigned_shards": 0,
   "number_of_pending_tasks": 0,
   "number_of_in_flight_fetch": 0,
   "task_max_waiting_in_queue_millis": 0,
   "active_shards_percent_as_number": 87.6207729468599
}

Also, when the cluster is up, I have a lot of GC ..

jasontedor · August 30, 2016, 1:53pm

Can you just provide the Java stack frames, no need for the mixed mode (leave off the -m)? Maybe just paste it in a gist instead of trying to paste it here?

veve90 · August 30, 2016, 3:47pm

Ofcourse,

While running: https://gist.github.com/veve90/6c26ddde3c29def42a8d922088611fd7
When just restared: https://gist.github.com/veve90/179f020d54b0a11c1b6f6ef5e91928f0

And the full ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];] error:

gist.github.com

https://gist.github.com/veve90/52004dcff34ff6ce816d11a6204077b8

Elasticsearch - ClusterBlockException blocked by: SERVICE_UNAVAILABLE

    [2016-08-30 18:42:02,765][WARN ][rest.suppressed          ] /_cat/indices Params: {v=}
    ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
    	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:157)
    	at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.checkGlobalBlock(TransportIndicesStatsAction.java:70)
    	at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.checkGlobalBlock(TransportIndicesStatsAction.java:47)
    	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.<init>(TransportBroadcastByNodeAction.java:238)
    	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:216)
    	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:78)
    	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:174)
    	at org.elasticsearch.action.ingest.IngestActionFilter.apply(IngestActionFilter.java:80)

This file has been truncated. show original

jasontedor · August 31, 2016, 10:12am

To be clear, you are talking about how it long it takes Elasticsearch to recover after restarting, right?

veve90 · September 2, 2016, 9:52am

Yes, I'm taking about how long Elasticsearch is taking to be green(usable in Kibana) after I'm restarting.

jasontedor · September 6, 2016, 9:02am

From your thread dump, it looks like Elasticsearch is verifying index metadata (take a look at thread elasticsearch[elk][generic][T#2]). Do you have a large number of indices with large mappings? What are the specs of the machine that you're on? Can you currently reproduce this reliably; that is, do you still have the metadata from this cluster?

veve90 · September 7, 2016, 9:01am

Hello,

I have about 250 indices, but I have too many shards as far as I saw, I'm trying to fix this.
I have about 20-30 mappings / event and each index has about 1-1.5 GB of data

What do you mean by metadata?

jasontedor · September 7, 2016, 12:06pm

These two things in combination are very likely what is causing your cluster to take so long to start up.

The index metadata, basically the index settings, aliases, and mappings.

Topic		Replies	Views
ElasticSearch hangs/freezes EC2 box Elasticsearch	2	343	July 6, 2017
EC2 instance hanging after a few hours Elasticsearch	6	1284	July 6, 2017
ElasticSearch 2.3.4 grinding to a halt Elasticsearch	10	1306	July 5, 2017
Elasticsearch timeout after 30000ms Elasticsearch	7	705	June 25, 2019
API timeout Elasticsearch	1	404	February 21, 2019

Elasticsearch is taking 2 hours to resart

Related topics