Network.host issue: elasticsearch.service: Main process exited, code=exited, status=78

Hi,

I try to configure an Elasticsearch cluster on Ubuntu 18.04 (on Azure). The basic installation is fine. Nevertheless when I try to configure the "network.host" value in the configuration file the Elasticsearch service does not start.

Error message:

systemd[1]: Started Elasticsearch.
-- Subject: Unit elasticsearch.service has finished start-up
-- Defined-By: systemd
-- Support: Enterprise open source support | Ubuntu

-- Unit elasticsearch.service has finished starting up.

-- The start-up result is RESULT.
elasticsearch[3955]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

The following entries don't work:
network.host: 0.0.0.0
network.host: eth0
network.host: 10.79.10.17

  • Elasticsearch version: 7.3.1
  • JVM version: OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10)
  • OS version: Ubuntu 18.04 LTS
  • vm.max_map_count=524288 (it does not work neither with default nor with 262144)

By default it refuse the connection for static ip
curl http://10.79.10.17:9200
curl: (7) Failed to connect to 10.79.10.17 port 9200: Connection refused

When I try to configure network.host value the services does not start.

Thank you for your help in this matter

can you check the logfile in /var/log/elasticsearch that is named like your cluster name is configured? I assume, that a bootstrap check has failed and that the log file contains information how to fix this, but this is just an assumption for now.

See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/starting-elasticsearch.html#start-es-deb-systemd

My config file:

Use a descriptive name for your cluster:

cluster.name: my-search

------------------------------------ Node ------------------------------------

Use a descriptive name for the node:

node.name: ${HOSTNAME}
node.ingest: true
node.data: false
node.master: false

Add custom attributes to the node:

#node.attr.rack: r1

----------------------------------- Paths ------------------------------------

Path to directory where to store the data (separate multiple locations by comma):

path.data: /var/lib/elasticsearch

Path to log files:

path.logs: /var/log/elasticsearch

----------------------------------- Memory -----------------------------------

Lock the memory on startup:

#bootstrap.memory_lock: true

Make sure that the heap size is set to about half the memory available

on the system and that the owner of the process is allowed to use this

limit.

Elasticsearch performs poorly when the system is swapping the memory.

---------------------------------- Network -----------------------------------

Set the bind address to a specific IP (IPv4 or IPv6):

network.host: 0.0.0.0

Set a custom port for HTTP:

this is the configuration file from /etc/elasticsearch, not the log file

/var/log/elasticsearch/gc.log

I can see this in the log:
Entering safepoint region: GenCollectForAllocation
GC(2) Pause Young (Allocation Failure)
GC(2) Using 8 workers of 8 for evacuation
GC(2) Desired survivor size 17891328 bytes, new threshold 6 (max threshold 6)
GC(2) Age table with threshold 6 (max threshold 6)
GC(2) - age 1: 2505304 bytes, 2505304 total
GC(2) - age 2: 4066352 bytes, 6571656 total
GC(2) - age 3: 10031824 bytes, 16603480 total
GC(2) ParNew: 298211K->23277K(314560K)
GC(2) CMS: 0K->0K(699072K)
GC(2) Metaspace: 20810K->20810K(1069056K)
GC(2) Pause Young (Allocation Failure) 291M->22M(989M) 5.505ms
GC(2) User=0.04s Sys=0.00s Real=0.01s
Leaving safepoint region
Total time for which application threads were stopped: 0.0057126 seconds, Stopping threads took: 0.0000459 seconds

as written above, there should be a file, that is named like your cluster name, thus my-search.log in the log directory, please show the output of that one.

You are right. I am not so familiar with elasticsearch logging.

Here is the my-search.log

Thanks again

Perhaps take a look at this post.... May explain why elasticsearch fails when you try to start it when you change network.host

See this snippet from the logfile

[2019-09-24T13:35:48,085][INFO ][o.e.b.BootstrapChecks    ] [eslnx02] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-09-24T13:35:48,089][ERROR][o.e.b.Bootstrap          ] [eslnx02] node validation exception
[3] bootstrap checks failed
[1]: initial heap size [2147483648] not equal to maximum heap size [4294967296]; this can cause resize pauses and prevents mlockall from locking the entire heap
[2]: memory locking requested for elasticsearch process but memory is not locked
[3]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured

See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/bootstrap-checks.html

I have fount the root cause of the issue:

  • If you configure the network.host value you also have to configure the transport.host.

This means you have to add the following section to elasticsearch.yml and your system will work.

network.host: 0.0.0.0
http.port: 9200

transport.host: _site_
transport.tcp.port: 9300

Then you should restart the service then it works. :slight_smile:

Additionally if you would like to enforce elasticsearch to listens on IPv4 sou have to add the next line to /etc/elasticsearch/jvm.options:

-Djava.net.preferIPv4Stack=true

So my environment works fine now. :slight_smile:

My comment: The related document does not mention this. :disappointed_relieved::cry:

Thank you for your support and your suggestions.

Have a great day.

I don't think that's true, and I have tried and confirmed that there is no problem running Elasticsearch with network.host set and transport.host unset.

The issue you had was that there were bootstrap checks failing.

Thank you for your feedback. Nevertheless the bootstrap chack was ok and my solution was the key.

This solved the same issue on my Windows host as well.

Can you help us understand how to reproduce the problem you solved by setting transport.host? Normally this setting should not be set.

Sure. :slight_smile:
If you merely configure this:

network.host: 0.0.0.0
http.port: 9200

You are facing the issue you can see above. Then if you expand your config file with the transport.host related configuration, the system works fine.

transport.host: _site_
transport.tcp.port: 9300

All together is the solution:

network.host: 0.0.0.0
http.port: 9200

transport.host: _site_
transport.tcp.port: 9300

My working elasticsearch.yml configuration now:

cluster.name: my-search
node.name: ${HOSTNAME}
node.data: false
node.master: true
node.ingest: false
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
transport.host: _site_
transport.tcp.port: 9300

Response of get request on port 9200:

{
    "name": "eslnx02",
    "cluster_name": "my-search",
    "cluster_uuid": "tpIiL4e5QTicAdzspRM-GQ",
    "version": {
        "number": "7.3.2",
        "build_flavor": "default",
        "build_type": "deb",
        "build_hash": "1c1faf1",
        "build_date": "2019-09-06T14:40:30.409026Z",
        "build_snapshot": false,
        "lucene_version": "8.1.0",
        "minimum_wire_compatibility_version": "6.8.0",
        "minimum_index_compatibility_version": "6.0.0-beta1"
    },
    "tagline": "You Know, for Search"
}

When your node starts up, it logs two lines containing the string bound_addresses that look like this:

[2019-09-26T12:41:10,124][INFO ][o.e.t.TransportService   ] [node-0] publish_address {192.168.1.139:9300}, bound_addresses {192.168.1.139:9300}, {192.168.1.179:9300}

and

[2019-09-26T12:41:10,787][INFO ][o.e.h.AbstractHttpServerTransport] [node-0] publish_address {192.168.1.139:9200}, bound_addresses {192.168.1.139:9200}, {192.168.1.179:9200}

These might be logged some time apart. Can you share these lines here?

Absolutely

[2019-09-26T10:09:39,513][INFO ][o.e.t.TransportService   ] [eslnx02] publish_address {10.10.10.21:9300}, bound_addresses {10.10.10.21:9300}
[2019-09-26T10:09:39,789][INFO ][o.e.h.AbstractHttpServerTransport] [eslnx02] publish_address {10.10.10.21:9200}, bound_addresses {0.0.0.0:9200}

Thanks, I am wondering if there's a bug here. Do you see a line saying bound or publishing to a non-loopback address shortly after the first of these? I.e.:

[2019-09-26T13:04:25,325][INFO ][o.e.t.TransportService   ] [node-0] publish_address {192.168.1.139:9302}, bound_addresses {192.168.1.139:9302}, {192.168.1.179:9302}
[2019-09-26T13:04:25,337][INFO ][o.e.b.BootstrapChecks    ] [node-0] bound or publishing to a non-loopback address, enforcing bootstrap checks

Would it be possible to share the whole log output from starting up the node through to seeing the first master node changed message?

Also, could you comment out just the transport.* lines in your config, restart the node and share the whole log output from starting up the node through to the whole message saying bootstrap checks failed?

:slight_smile:

bootstrap check related row is there imediately afterpublish_address {10.10.10.21:9300}, bound_addresses {10.10.10.21:9300} :

[2019-09-26T13:10:33,104][INFO ][o.e.b.BootstrapChecks    ] [eslnx02] bound or publishing to a non-loopback address, enforcing bootstrap checks

my-search.log from starting... entry:

[2019-09-26T13:10:32,850][INFO ][o.e.n.Node               ] [eslnx02] starting ...
[2019-09-26T13:10:33,097][INFO ][o.e.t.TransportService   ] [eslnx02] publish_address {10.10.10.21:9300}, bound_addresses {10.10.10.21:9300}
[2019-09-26T13:10:33,104][INFO ][o.e.b.BootstrapChecks    ] [eslnx02] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-09-26T13:10:33,130][INFO ][o.e.c.c.Coordinator      ] [eslnx02] cluster UUID [tpIiL4e5QTicAdzspRM-GQ]
[2019-09-26T13:10:33,230][INFO ][o.e.c.s.MasterService    ] [eslnx02] elected-as-master ([1] nodes joined)[{eslnx02}{DVNKx4n-QVuQbZwTvWdXDg}{ukvBIwpRSx28Sa1Ex9kzhw}{10.10.10.21}{10.10.10.21:9300}{m}{ml.machine_memory=33714167808, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 8, version: 37, reason: master node changed {previous [], current [{eslnx02}{DVNKx4n-QVuQbZwTvWdXDg}{ukvBIwpRSx28Sa1Ex9kzhw}{10.10.10.21}{10.10.10.21:9300}{m}{ml.machine_memory=33714167808, xpack.installed=true, ml.max_open_jobs=20}]}
[2019-09-26T13:10:33,277][INFO ][o.e.c.s.ClusterApplierService] [eslnx02] master node changed {previous [], current [{eslnx02}{DVNKx4n-QVuQbZwTvWdXDg}{ukvBIwpRSx28Sa1Ex9kzhw}{10.10.10.21}{10.10.10.21:9300}{m}{ml.machine_memory=33714167808, xpack.installed=true, ml.max_open_jobs=20}]}, term: 8, version: 37, reason: Publication{term=8, version=37}
[2019-09-26T13:10:33,330][INFO ][o.e.h.AbstractHttpServerTransport] [eslnx02] publish_address {10.10.10.21:9200}, bound_addresses {0.0.0.0:9200}
[2019-09-26T13:10:33,330][INFO ][o.e.n.Node               ] [eslnx02] started
[2019-09-26T13:10:33,486][INFO ][o.e.l.LicenseService     ] [eslnx02] license [50abffd9-f659-4aa2-913b-dd81223921aa] mode [basic] - valid
[2019-09-26T13:10:33,487][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [eslnx02] Active license is now [BASIC]; Security is disabled
[2019-09-26T13:10:33,495][INFO ][o.e.g.GatewayService     ] [eslnx02] recovered [0] indices into cluster_state

Bootsrtap check failed:

[1] bootstrap checks failed
[1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured

This means one of them is required from discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes

Could you please send me the documentation where I can find this? :wink:

That specific check is documented here.

However I really would like to see the complete logs from your node, both from when it successfully starts up and when it fails. The excerpts you've shared are unfortunately not enough for us to work out whether there's a bug that needs fixing here.