Elasticsearch takes minutes to start

Abid_Hussain · May 2, 2017, 2:21pm

Hi all,

I downloaded elasticsearch 5.3.2 and changed some settings. There is no data indexed. Still it takes about 5 minutes until the first log message appears.

Here are the settings I have changed in my config file:
cluster.name: my-application
node.name: node-1
http.port: 29200
transport.tcp.port: 29300-29400

OS is SUSE Linux Enterprise Server 11.

Any hints what's going on?

nik9000 · May 2, 2017, 2:41pm

I have no idea. I'd hit the process with jstat -gcutil <pid> 3s 100 to see if it is gcing (would be weird, but easy to check). If the FGC column goes up a bunch then that tells us that it is GCing. Otherwise I'd hit it with jstack <pid> | tee elasticsearch_stack and have a look at the output and see if you have something weird going on.

Abid_Hussain · May 2, 2017, 3:36pm

Running jstack show two threads in WAITING state (all others are in state RUNNING):

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007ffa68194800 nid=0x197 in Object.wait() [0x00007ffa61ac8000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000080008e98> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked <0x0000000080008e98> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007ffa68190000 nid=0x196 in Object.wait() [0x00007ffa61bc9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000080006b40> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
- locked <0x0000000080006b40> (a java.lang.ref.Reference$Lock)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

Running jstat show the following (I only pasted the output starting seconds before the first log message appears):

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  0,00   0,00  20,00   0,00  17,28  19,76      0    0,000     0    0,000    0,000
  0,00   0,00  20,00   0,00  17,28  19,76      0    0,000     0    0,000    0,000
  0,00   0,00  26,01   0,00  17,28  19,76      0    0,000     0    0,000    0,000
  0,00   0,00  98,28   0,00  17,28  19,76      0    0,000     1    0,050    0,050
  0,00  59,15  43,79   0,00  94,21  87,26      1    0,046     1    0,050    0,096
  0,00  59,15  58,47   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  79,86   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  81,20   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  91,41   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  98,60   0,00  94,21  87,26      1    0,046     2    0,137    0,183
100,00   0,00  13,30   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  21,83   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  22,12   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  24,64   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  32,91   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  33,13   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  34,01   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  38,46   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  43,05   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  46,37   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  48,38   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  56,33   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  56,48   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  58,67   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  66,76   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  66,98   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  68,60   6,97  93,00  86,25      2    0,290     2    0,137    0,427

I'm not that familiar with jstat but to me the output seems not to point to any unusual behaviour...?

nik9000 · May 2, 2017, 3:44pm

jstack looks good. I think the running threads are going to be important though. Can you make a gist of that?

Abid_Hussain · May 2, 2017, 3:49pm

See https://gist.github.com/abid76/32c453f351967fae6744e2a284a6a35e

Regards,

Abid

nik9000 · May 2, 2017, 4:08pm

I wonder if this is a slow NFS mount like in

If so, you can work around it by dropping the NFS mounts from the machine.

Abid_Hussain · May 3, 2017, 11:46am

Elasticsearch is not installed on an NFS mount.

I wonder if it has something to do with the node discovery on startup.

As we use ES as single node installation, how can I disable node discovery?

In ES 1.4 (which we are currently using in our productive environment) this is done by setting

discovery.zen.ping.multicast.enabled: false

Abid_Hussain · May 3, 2017, 2:37pm

Any suggestions how to configure a single node setup? Maybe this would help ...

Additionally, here is the log after startup.

[2017-05-03T15:35:31,624][INFO ][o.e.n.Node               ] [node-1] initializing ...
[2017-05-03T15:35:31,807][INFO ][o.e.e.NodeEnvironment    ] [node-1] using [1] data paths, mounts [[/ (/dev/mapper/system-root)]], net usable_space [1tb], net total_space [1.9tb], spins? [possibly], types [ext3]
[2017-05-03T15:35:31,808][INFO ][o.e.e.NodeEnvironment    ] [node-1] heap size [1.9gb], compressed ordinary object pointers [true]
[2017-05-03T15:35:31,842][INFO ][o.e.n.Node               ] [node-1] node name [node-1], node ID [VNBeHdQMTyeQ-af4JTdlTg]
[2017-05-03T15:35:31,843][INFO ][o.e.n.Node               ] [node-1] version[5.3.2], pid[9978], build[3068195/2017-04-24T16:15:59.481Z], OS[Linux/3.0.101-80-xen/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_102/25.102-b14]
[2017-05-03T15:35:33,531][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [aggs-matrix-stats]
[2017-05-03T15:35:33,531][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [ingest-common]
[2017-05-03T15:35:33,531][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-expression]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-groovy]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-mustache]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-painless]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [percolator]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [reindex]
[2017-05-03T15:35:33,534][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [transport-netty3]
[2017-05-03T15:35:33,534][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [transport-netty4]
[2017-05-03T15:35:33,534][INFO ][o.e.p.PluginsService     ] [node-1] loaded plugin [x-pack]
[2017-05-03T15:35:36,079][DEBUG][o.e.a.ActionModule       ] Using REST wrapper from plugin org.elasticsearch.xpack.XPackPlugin
[2017-05-03T15:35:37,979][INFO ][o.e.n.Node               ] [node-1] initialized
[2017-05-03T15:35:37,979][INFO ][o.e.n.Node               ] [node-1] starting ...
[2017-05-03T15:35:38,328][INFO ][o.e.t.TransportService   ] [node-1] publish_address {127.0.0.1:29300}, bound_addresses {[::1]:29300}, {127.0.0.1:29300}, {127.0.0.2:29300}
[2017-05-03T15:35:38,348][WARN ][o.e.b.BootstrapChecks    ] [node-1] max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2017-05-03T15:35:38,348][WARN ][o.e.b.BootstrapChecks    ] [node-1] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[2017-05-03T15:35:41,450][INFO ][o.e.c.s.ClusterService   ] [node-1] new_master {node-1}{VNBeHdQMTyeQ-af4JTdlTg}{QPQmD9ddTmKnRauNaQa1ww}{127.0.0.1}{127.0.0.1:29300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-05-03T15:35:41,512][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [node-1] publish_address {127.0.0.1:29200}, bound_addresses {[::1]:29200}, {127.0.0.1:29200}, {127.0.0.2:29200}
[2017-05-03T15:35:41,519][INFO ][o.e.n.Node               ] [node-1] started
[2017-05-03T15:35:42,323][INFO ][o.e.l.LicenseService     ] [node-1] license [dc1c9071-33f0-4868-925f-3e1c366fd734] mode [basic] - valid
[2017-05-03T15:35:42,352][INFO ][o.e.g.GatewayService     ] [node-1] recovered [5] indices into cluster_state
[2017-05-03T15:35:43,091][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.monitoring-es-2-2017.05.02][0]] ...]).

jasontedor · May 3, 2017, 2:49pm

The jstack shows a file store issue, exactly what the PR that @nik9000 linked to is aimed at getting around.

This can happen, but then I would expect to see DNS calls in the jstack output which we do not see here. Also, you said that the stall is before the first log message appears, discovery is not loaded until well after this.

In 5.4.0 (not yet released) you will be able to set discovery.type to single-node but I do not think that that is your problem here.

Abid_Hussain · May 3, 2017, 3:54pm

The parent folder of ES installation is mounted as network drive (not NFS) into Windows. But after moving it to a different folder which is not mounted as network drive still it takes very long to start.

I have actually no idea what could be the reason.

Could you please point to me the lines in jstack output where the issue shows up? Is it this one (at line 79)?

"main" #1 prio=5 os_prio=0 tid=0x000000000060e800 nid=0x590 runnable [0x00007f51fbe08000]
   java.lang.Thread.State: RUNNABLE
  at sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)
  at sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)
  at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)
  at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:55)
  at sun.nio.fs.UnixFileStore.<init>(UnixFileStore.java:70)
  at sun.nio.fs.LinuxFileStore.<init>(LinuxFileStore.java:48)
  at sun.nio.fs.LinuxFileSystem.getFileStore(LinuxFileSystem.java:112)
  at sun.nio.fs.UnixFileSystem$FileStoreIterator.readNext(UnixFileSystem.java:213)
  at sun.nio.fs.UnixFileSystem$FileStoreIterator.hasNext(UnixFileSystem.java:224)
  - locked <0x0000000085b3f820> (a sun.nio.fs.UnixFileSystem$FileStoreIterator)
  at org.apache.lucene.util.IOUtils.getFileStore(IOUtils.java:543)
  ....

jasontedor · May 3, 2017, 4:09pm

Yes, it's exactly the stack trace that you show. Today we look at all file stores, it doesn't matter if Elasticsearch is using them or not. The PR that was linked to will remove this, we will only look at the file stores that Elasticsearch is using. I think this is exactly your problem.

Abid_Hussain · May 4, 2017, 3:06pm

When I got it right, startup takes such a long time because somewhere in the file system there's mounted a slow windows share (this is the case on the system on which I run ES). Even if it is a folder which ES never will use.

Sounds weird to me. We never had such a problem using ES 1.4. (Our aim is to migrate from ES 1.4 to current version).

Will this be fixed in the near future? Otherwise I'll go on trying ES 2.x.

jasontedor · May 4, 2017, 3:35pm

Software evolves. Elasticsearch 1.4 did not have any of the code that is causing this issue here, yet this code was added for a reason. It had unforeseen consequences in situations like yours but that is the nature of software.

It will be fixed no later than 6.0.0, it might be fixed sooner but I can not promise that. I think the same issue exists in the 2.x series (where this code was first added). It's only recent that the prevalence of this issue has come to our attention.

Abid_Hussain · May 5, 2017, 6:49am

OK, thanks for clarification.

jasontedor · May 5, 2017, 10:21am

You're welcome.

system · June 2, 2017, 10:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Instant crash on startup Elasticsearch	15	5107	July 6, 2017
HELP! My production is dead Elasticsearch	9	450	July 6, 2017
Elasticsearch keeps going into red status, and thinking I have more nodes than I do Elasticsearch	8	1213	July 6, 2017
Intermittent 505's Elasticsearch	10	708	July 6, 2017
Lack of memory? Elasticsearch	11	801	July 6, 2017

Elasticsearch takes minutes to start

Related topics