Using Nginx Proxy + ES + Spark

I'd like to setup Nginx proxy to secure my ES Cluster. But I'm stuck with setting Nginx proxy + ES + Spark.

I bound network.host to _local_ to prevent connecting directly without proxy.

http.port: 9200
transport.tcp.port: 9300
network.host: _local_
transport.host: _non_loopback:ipv4_

And configuration of nginx is below (nothing different from the es manual):

server {
    listen       19200;
    server_name  localhost;

    location / {
      proxy_pass http://localhost:9200/;
      proxy_http_version 1.1;
      proxy_set_header Connection "Keep-Alive";
      proxy_set_header Proxy-Connection "Keep-Alive";
    }
}

Every node has same settings like belows and All ES plugins (head/marvel/kibana) and my python script worked well after adding proxy.

+------------------+       +------------------+                        +------------------+
|Node 1            |       |Node 2            |                        |Node n            |
+------------------+       +------------------+                        +------------------+
| Nginx            |       | Nginx            |          .             | Nginx            |
|(listen on 19200) |       |(listen on 19200) |          .             |(listen on 19200) |
|                  |       |                  |          .             |                  |
| Elasticsearch    |       | Elasticsearch    |                        | Elasticsearch    |
|(listens on 9200) |       |(listens on 9200) |                        |(listens on 9200) |
+------------------+       +------------------+                        +------------------+

I've changed Spark configuration like this (without nginx proxy my spark worked well):

conf.set("spark.driver.allowMultipleContexts", "true")
conf.set("es.index.auto.create", "true")
conf.set("es.nodes.discovery", "true")
conf.set("es.nodes", "es_hostname:19200")

But after setting a proxy, I'm getting an Exception org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException

I enabled a TRACE logging level and I realized that Spark tried to connect 127.0.0.1:9200 or public_ip:9200. That's why I'm getting an NoNodesLeftException.

After reading [ES Hadoop Configuration], I found es.net.proxy.http.host and es.net.proxy.http.port. These settings made spark work well. But the problem is that ONLY the proxy node are receiving _bulk requests, so it's incoming traffic is very high.

Question.

Can I conclude that es.net.proxy.http.(host,port) is the only way to use (proxy + ES + Spark)? Is there another way to use nginx proxy and Spark?

I'm using ES 2.3.2 and es hadoop 2.3.2.

Thanks.

Hello! To answer your question, I have to touch on node discovery and traffic very quickly.

When es-hadoop bootstraps, it attempts to discover the nodes in your Elasticsearch cluster by executing a REST call to the /_nodes/transport endpoint. This returns a list of the cluster nodes and the addresses from which you can reach them. The connector will attempt to communicate with each node using their returned http_address. Because you have set the network.host property to be localhost, each node will say that localhost is the only address for HTTP requests. This has to do with how the bind and publish addresses are set up for the HTTP module.

Generally speaking, the host settings in Elasticsearch can be intimidating, but they provide you with a lot of control over which interfaces your cluster uses for traffic. Elasticsearch deals with two types of network traffic: Transport and HTTP. There are two types hosts for each of these traffic types: *.bind_host and *.publish_host. They're normally the same values for both, but when working with proxies, you may have to make some changes to them.

Now, to get back to the issue at hand:

You have the network.host configured to use the local loopback and you have the transport.host configured to use a non loopback ip address. This makes sense because you probably don't want Nginx to get in the way between nodes in your cluster. When a node starts up, it chooses the address in transport.host to bind to and publish for Transport traffic. If this were unset, it would default to network.host. You do not have any specific http settings, so the node defaults to network.host to bind to and publish for HTTP requests.

If you want ES-Hadoop to correctly discover the nodes, you'll need to configure the http.publish_host to be the address of the Nginx proxy you want the client to go through. When the node starts up, it will correctly ignore all incoming HTTP traffic not coming from localhost since the http.bind_host will default to the network.host property value. On the other side, when broadcasting its address to clients, it will return the value set for http.publish_host (the Nginx proxy address). This way, when the node discovery takes place, the nodes endpoint will return the address of the proxy for the connector to hit instead of just localhost. When following this pattern, you probably don't need the other proxy settings for your es-hadoop configuration.

Now, whether or not this proxy will afford any true security is a completely different discussion. In this specific case, anyone would be able to connect to the cluster via the regular old Java Transport client since the Transport traffic is un-policed. To be completely honest, if you're looking to implement serious access control to Elasticsearch, I would highly recommend the Shield Plugin instead.

I hope this helps!

Thank you for the great help and support.

I'll try http.publish_host.

BTW,

In this specific case, anyone would be able to connect to the cluster via the regular old Java Transport client since the Transport traffic is un-policed.

Does this mean that indices might be deleted via old Java Transport? What I want to do is preventing indices from being deleted. So I thought that Nginx would be a simple and nice solution as Playing HTTP Tricks with Nginx and NGINX & Elasticsearch: Better Together said.

That is true. While the Transport client will eventually be deprecated for general client use, it is still a valid and accessible java package for most distributions. When the nodes start up, they bind their transport module to an externally available IP address. Anyone with a transport client could connect to the cluster, obtain an admin client, and delete which ever indices they so wish. This is why I recommend Shield, as the provided access control functionality is within Elasticsearch, and much harder to circumvent.

Hi. James.

This is the test result of http.publish_host you've told me.

I'm running ES (9200 port) and nginx (19200 port) on the same machine, so I've changed like this:

http.port: 9200
transport.tcp.port: 9300
network.host: _local_
http.publish_host: _non_loopback:ipv4_ <= this has been added
transport.host: _non_loopback:ipv4_

With these settings, Spark tried to connect non_loopback:9200 but the problem is that9200 port is only bound on localhost So NoNodesLeftException throwed.

If ES and nginx run on different boxes, I think Spark works well. But in order to that, network.host should be bound on non_loopback which means anyone can execute REST APIs directly on ES without nginx (not secure). That's why I installed ES and nginx on the same box.

Conclusion. I hope ES would have something like http.publish_port which is not supported yet.

Thank you for your help.

That should be a property that is available... Please see this documentation: HTTP | Elasticsearch Guide [8.11] | Elastic

Oops. I didn't know that because I was only reading Network Settings and it was not listed on there.

Thanks a lot! 감사합니다!