Elasticsearch is taking 2 hours to resart

Hello,

When restaring elasticsearch I first get: SERVICE_UNAVAILABLE/1/state not recovered / initialized
Then, after a while I get: Request timeout after 30000ms.

This whole process takes about 2 hours.

I have a 30 GB EC2, and about 500 GB data in /var/lib
I have indexes by date (dates since 1st of january).

Is there a way to get this time down?

Thank you!

What version of Elasticsearch, OS, and JVM; and what plugins do you have installed?

Can you take a thread dump using jstack and post the results here so that we can see where shutdown is hanging?

Elasticsearch 5.0.0-alpha1
OS: CentOS 7
I have no plugins installed, other by the default ones

jstack -m

ugger attached successfully.
Server compiler detected.
JVM version is 25.65-b01
Deadlock Detection:

No deadlocks found.

......

----------------- 23271 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac51a4f13    _ZN2os5sleepEP6Threadlb + 0x283
0x00007fbac4fa6e22    JVM_Sleep + 0x3b2
0x00007fbab16cbb31    <Unknown compiled code>
0x00007fbab09ad98d    * org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run() bci:21 line:758 (Interpreted frame)
0x00007fbab09a64e7    <StubRoutines>
0x00007fbac4f13be6    _ZN9JavaCalls11call_helperEP9JavaValueP12methodHandleP17JavaCallArgumentsP6Thread + 0x1056
0x00007fbac4f140f1    _ZN9JavaCalls12call_virtualEP9JavaValue11KlassHandleP6SymbolS4_P17JavaCallArgumentsP6Thread + 0x321
0x00007fbac4f14597    _ZN9JavaCalls12call_virtualEP9JavaValue6Handle11KlassHandleP6SymbolS5_P6Thread + 0x47

...

0x00007fbac5958863    __GI_epoll_wait + 0x33
0x00007fbab0ae25f2    <Unknown compiled code>
----------------- 23311 -----------------

.....

0x00007fbac5958863    __GI_epoll_wait + 0x33
0x00007fbab09bb6d4    * sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) bci:0 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.EPollArrayWrapper.poll(long) bci:18 line:269 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.EPollSelectorImpl.doSelect(long) bci:28 line:79 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.SelectorImpl.lockAndDoSelect(long) bci:37 line:86 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.SelectorImpl.select(long) bci:30 line:97 (Interpreted frame)
0x00007fbab09ad3d0    * sun.nio.ch.SelectorImpl.select() bci:2 line:101 (Interpreted frame)
0x00007fbab09ad3d0    * org.jboss.netty.channel.socket.nio.NioServerBoss.select(java.nio.channels.Selector) bci:1 line:163 (Interpreted frame)
0x00007fbab09ad3d0    * org.jboss.netty.channel.socket.nio.AbstractNioSelector.run() bci:56 line:212 (Interpreted frame)
0x00007fbab09ad98d    * org.jboss.netty.channel.socket.nio.NioServerBoss.run() bci:1 line:42 (Interpreted frame)
0x00007fbab09ad9d2    * org.jboss.netty.util.ThreadRenamingRunnable.run() bci:55 line:108 (Interpreted frame)
0x00007fbab09ad9d2    * org.jboss.netty.util.internal.DeadLockProofWorker$1.run() bci:14 line:42 (Interpreted frame)
0x00007fbab09ad9d2    * ....
----------------- 23343 -----------------
0x00007fbab163da5a    <Unknown compiled code>
----------------- 23348 -----------------
0x00007fbac5958863    __GI_epoll_wait + 0x33
0x00007fbab0ae25f2    <Unknown compiled code>

...

 ----------------- 23365 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>

...

----------------- 26364 -----------------
0x00007fbac60486d5    __pthread_cond_wait + 0xc5
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>
0x00007fbab1f58f2c    * java.util.concurrent.locks.LockSupport.park(java.lang.Object) bci:14 line:175 (Compiled frame)
* java.util.concurrent.LinkedTransferQueue.awaitMatch(java.util.concurrent.LinkedTransferQueue$Node, java.util.concurrent.LinkedTransferQueue$Node, java.lang.Object, boolean, long) bci:184 line:737 (Interpreted frame)
0x00007fbab21c6d3c    * java.util.concurrent.LinkedTransferQueue.xfer(java.lang.Object, boolean, int, long) bci:286 line:647 (Compiled frame)
* java.util.concurrent.LinkedTransferQueue.take() bci:5 line:1269 (Compiled frame)
* java.util.concurrent.ThreadPoolExecutor.getTask() bci:149 line:1067 (Interpreted frame)
0x00007fbab09ad710    * java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
0x00007fbab09ad98d    * ...
----------------- 26374 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>
----------------- 26375 -----------------
..................................

----------------- 5096 -----------------
0x00007fbac6048a82    __pthread_cond_timedwait + 0x132
0x00007fbac5317355    Unsafe_Park + 0xf5
0x00007fbab0cf12ea    <Unknown compiled code>
----------------- 23234 -----------------
0x00007fbac6045ef7    pthread_join + 0xa7

Is this what you needed?(I don't get a lot of things from this stack/ I couldn't post it all since is too big)

Thank you!

Here are some other stats on the cluster that may help..

{
   "cluster_name": "elk-cluster",
   "status": "red",
   "timed_out": false,
   "number_of_nodes": 1,
   "number_of_data_nodes": 1,
   "active_primary_shards": 1451,
   "active_shards": 1451,
   "relocating_shards": 0,
   "initializing_shards": 0,
   "unassigned_shards": 205,
   "delayed_unassigned_shards": 0,
   "number_of_pending_tasks": 0,
   "number_of_in_flight_fetch": 0,
   "task_max_waiting_in_queue_millis": 0,
   "active_shards_percent_as_number": 87.6207729468599
}

Also, when the cluster is up, I have a lot of GC ..

Can you just provide the Java stack frames, no need for the mixed mode (leave off the -m)? Maybe just paste it in a gist instead of trying to paste it here?

Ofcourse,

While running: https://gist.github.com/veve90/6c26ddde3c29def42a8d922088611fd7
When just restared: https://gist.github.com/veve90/179f020d54b0a11c1b6f6ef5e91928f0

And the full ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];] error:

To be clear, you are talking about how it long it takes Elasticsearch to recover after restarting, right?

Yes, I'm taking about how long Elasticsearch is taking to be green(usable in Kibana) after I'm restarting.

From your thread dump, it looks like Elasticsearch is verifying index metadata (take a look at thread elasticsearch[elk][generic][T#2]). Do you have a large number of indices with large mappings? What are the specs of the machine that you're on? Can you currently reproduce this reliably; that is, do you still have the metadata from this cluster?

Hello,

I have about 250 indices, but I have too many shards as far as I saw, I'm trying to fix this.
I have about 20-30 mappings / event and each index has about 1-1.5 GB of data

What do you mean by metadata?

These two things in combination are very likely what is causing your cluster to take so long to start up.

The index metadata, basically the index settings, aliases, and mappings.