Elasticsearch takes minutes to start

Hi all,

I downloaded elasticsearch 5.3.2 and changed some settings. There is no data indexed. Still it takes about 5 minutes until the first log message appears.

Here are the settings I have changed in my config file:
cluster.name: my-application
node.name: node-1
http.port: 29200
transport.tcp.port: 29300-29400

OS is SUSE Linux Enterprise Server 11.

Any hints what's going on?

I have no idea. I'd hit the process with jstat -gcutil <pid> 3s 100 to see if it is gcing (would be weird, but easy to check). If the FGC column goes up a bunch then that tells us that it is GCing. Otherwise I'd hit it with jstack <pid> | tee elasticsearch_stack and have a look at the output and see if you have something weird going on.

Running jstack show two threads in WAITING state (all others are in state RUNNING):

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007ffa68194800 nid=0x197 in Object.wait() [0x00007ffa61ac8000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000080008e98> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked <0x0000000080008e98> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007ffa68190000 nid=0x196 in Object.wait() [0x00007ffa61bc9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000080006b40> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
- locked <0x0000000080006b40> (a java.lang.ref.Reference$Lock)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

Running jstat show the following (I only pasted the output starting seconds before the first log message appears):

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  0,00   0,00  20,00   0,00  17,28  19,76      0    0,000     0    0,000    0,000
  0,00   0,00  20,00   0,00  17,28  19,76      0    0,000     0    0,000    0,000
  0,00   0,00  26,01   0,00  17,28  19,76      0    0,000     0    0,000    0,000
  0,00   0,00  98,28   0,00  17,28  19,76      0    0,000     1    0,050    0,050
  0,00  59,15  43,79   0,00  94,21  87,26      1    0,046     1    0,050    0,096
  0,00  59,15  58,47   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  79,86   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  81,20   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  91,41   0,00  94,21  87,26      1    0,046     2    0,137    0,183
  0,00  59,15  98,60   0,00  94,21  87,26      1    0,046     2    0,137    0,183
100,00   0,00  13,30   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  21,83   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  22,12   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  24,64   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  32,91   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  33,13   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  34,01   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  38,46   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  43,05   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  46,37   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  48,38   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  56,33   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  56,48   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  58,67   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  66,76   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  66,98   6,97  93,00  86,25      2    0,290     2    0,137    0,427
100,00   0,00  68,60   6,97  93,00  86,25      2    0,290     2    0,137    0,427

I'm not that familiar with jstat but to me the output seems not to point to any unusual behaviour...?

jstack looks good. I think the running threads are going to be important though. Can you make a gist of that?

See https://gist.github.com/abid76/32c453f351967fae6744e2a284a6a35e

Regards,

Abid

I wonder if this is a slow NFS mount like in

If so, you can work around it by dropping the NFS mounts from the machine.

Elasticsearch is not installed on an NFS mount.

I wonder if it has something to do with the node discovery on startup.

As we use ES as single node installation, how can I disable node discovery?

In ES 1.4 (which we are currently using in our productive environment) this is done by setting

discovery.zen.ping.multicast.enabled: false

Any suggestions how to configure a single node setup? Maybe this would help ...

Additionally, here is the log after startup.

[2017-05-03T15:35:31,624][INFO ][o.e.n.Node               ] [node-1] initializing ...
[2017-05-03T15:35:31,807][INFO ][o.e.e.NodeEnvironment    ] [node-1] using [1] data paths, mounts [[/ (/dev/mapper/system-root)]], net usable_space [1tb], net total_space [1.9tb], spins? [possibly], types [ext3]
[2017-05-03T15:35:31,808][INFO ][o.e.e.NodeEnvironment    ] [node-1] heap size [1.9gb], compressed ordinary object pointers [true]
[2017-05-03T15:35:31,842][INFO ][o.e.n.Node               ] [node-1] node name [node-1], node ID [VNBeHdQMTyeQ-af4JTdlTg]
[2017-05-03T15:35:31,843][INFO ][o.e.n.Node               ] [node-1] version[5.3.2], pid[9978], build[3068195/2017-04-24T16:15:59.481Z], OS[Linux/3.0.101-80-xen/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_102/25.102-b14]
[2017-05-03T15:35:33,531][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [aggs-matrix-stats]
[2017-05-03T15:35:33,531][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [ingest-common]
[2017-05-03T15:35:33,531][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-expression]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-groovy]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-mustache]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [lang-painless]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [percolator]
[2017-05-03T15:35:33,532][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [reindex]
[2017-05-03T15:35:33,534][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [transport-netty3]
[2017-05-03T15:35:33,534][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [transport-netty4]
[2017-05-03T15:35:33,534][INFO ][o.e.p.PluginsService     ] [node-1] loaded plugin [x-pack]
[2017-05-03T15:35:36,079][DEBUG][o.e.a.ActionModule       ] Using REST wrapper from plugin org.elasticsearch.xpack.XPackPlugin
[2017-05-03T15:35:37,979][INFO ][o.e.n.Node               ] [node-1] initialized
[2017-05-03T15:35:37,979][INFO ][o.e.n.Node               ] [node-1] starting ...
[2017-05-03T15:35:38,328][INFO ][o.e.t.TransportService   ] [node-1] publish_address {127.0.0.1:29300}, bound_addresses {[::1]:29300}, {127.0.0.1:29300}, {127.0.0.2:29300}
[2017-05-03T15:35:38,348][WARN ][o.e.b.BootstrapChecks    ] [node-1] max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2017-05-03T15:35:38,348][WARN ][o.e.b.BootstrapChecks    ] [node-1] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[2017-05-03T15:35:41,450][INFO ][o.e.c.s.ClusterService   ] [node-1] new_master {node-1}{VNBeHdQMTyeQ-af4JTdlTg}{QPQmD9ddTmKnRauNaQa1ww}{127.0.0.1}{127.0.0.1:29300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-05-03T15:35:41,512][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [node-1] publish_address {127.0.0.1:29200}, bound_addresses {[::1]:29200}, {127.0.0.1:29200}, {127.0.0.2:29200}
[2017-05-03T15:35:41,519][INFO ][o.e.n.Node               ] [node-1] started
[2017-05-03T15:35:42,323][INFO ][o.e.l.LicenseService     ] [node-1] license [dc1c9071-33f0-4868-925f-3e1c366fd734] mode [basic] - valid
[2017-05-03T15:35:42,352][INFO ][o.e.g.GatewayService     ] [node-1] recovered [5] indices into cluster_state
[2017-05-03T15:35:43,091][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.monitoring-es-2-2017.05.02][0]] ...]).

The jstack shows a file store issue, exactly what the PR that @nik9000 linked to is aimed at getting around.

This can happen, but then I would expect to see DNS calls in the jstack output which we do not see here. Also, you said that the stall is before the first log message appears, discovery is not loaded until well after this.

In 5.4.0 (not yet released) you will be able to set discovery.type to single-node but I do not think that that is your problem here.

The parent folder of ES installation is mounted as network drive (not NFS) into Windows. But after moving it to a different folder which is not mounted as network drive still it takes very long to start.

I have actually no idea what could be the reason.

Could you please point to me the lines in jstack output where the issue shows up? Is it this one (at line 79)?

"main" #1 prio=5 os_prio=0 tid=0x000000000060e800 nid=0x590 runnable [0x00007f51fbe08000]
   java.lang.Thread.State: RUNNABLE
  at sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)
  at sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)
  at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)
  at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:55)
  at sun.nio.fs.UnixFileStore.<init>(UnixFileStore.java:70)
  at sun.nio.fs.LinuxFileStore.<init>(LinuxFileStore.java:48)
  at sun.nio.fs.LinuxFileSystem.getFileStore(LinuxFileSystem.java:112)
  at sun.nio.fs.UnixFileSystem$FileStoreIterator.readNext(UnixFileSystem.java:213)
  at sun.nio.fs.UnixFileSystem$FileStoreIterator.hasNext(UnixFileSystem.java:224)
  - locked <0x0000000085b3f820> (a sun.nio.fs.UnixFileSystem$FileStoreIterator)
  at org.apache.lucene.util.IOUtils.getFileStore(IOUtils.java:543)
  ....

Yes, it's exactly the stack trace that you show. Today we look at all file stores, it doesn't matter if Elasticsearch is using them or not. The PR that was linked to will remove this, we will only look at the file stores that Elasticsearch is using. I think this is exactly your problem.

When I got it right, startup takes such a long time because somewhere in the file system there's mounted a slow windows share (this is the case on the system on which I run ES). Even if it is a folder which ES never will use.

Sounds weird to me. We never had such a problem using ES 1.4. (Our aim is to migrate from ES 1.4 to current version).

Will this be fixed in the near future? Otherwise I'll go on trying ES 2.x.

Software evolves. Elasticsearch 1.4 did not have any of the code that is causing this issue here, yet this code was added for a reason. It had unforeseen consequences in situations like yours but that is the nature of software.

It will be fixed no later than 6.0.0, it might be fixed sooner but I can not promise that. I think the same issue exists in the 2.x series (where this code was first added). It's only recent that the prevalence of this issue has come to our attention.

OK, thanks for clarification.

You're welcome.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.