Node communication / Firewall question

Hi,

I am trying to set up a cluster where one node is behind a firewall (in a
company network), and one on a web-server at a web hosting company.
The use case is that data is generated by systems on the company intranet,
and should be made available for searching on a hosted web server - using
ES.

The company network is behind a firewall, and not reachable from the
outside. This can't be changed for security reasons. Opening ports on the
company firewall is out of the question.

The web-server node (data = true, master=false) should only receive index
updates from the node inside the company network and act as the backend to
the websites' search engines. There is no need for the web-server node to
send back index updates to the company network node.

I have tried to set up an ssh tunnel on the company node to connect through
using unicast discovery. But the web server node does not join the cluster.
It seems that the SSH tunnel from inside the firewall out isn't sufficient?

Maybe I am overcomplicating things and there is a better way to send index
data from one ES node to another? Doing it in a cluster appeared to me as a
the fastest, most efficient way to keep the web's search engine up-to-date,
but if this isn't possible, I am open to other suggestions, such as
rsync'ing index data.

Thank you very much
Ben

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If what I am trying to do is impossible without opening the firewall,
please let me know.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If there are no ports open for communication on the node inside the company network, you're pretty much out of luck.

If I'm understanding correctly, you want to use the ES instance running on your company networks as a client node to push data to the web-server node. While this is a common setup and can be achieved in unicast by pointing your internal node to the web-server node (acting as master), I think all nodes in the cluster must be able to both send and receive on the transport ports (I could be wrong on this).

However, since you're already using the web-server node to store the data being searched, why not just script your index updates over HTTP to the web-server node? Especially if you're not storing the data on the ES node in the company network -- if that's the case, you should definitely go ahead and just index your data over HTTP IMO. And if you need a replica for failover, just put a node up in the web-server node's data center. You can always access that data from within the company network, I'm just not sure if it can be accessed/manipulated from a client ES instance behind a firewall.

On Sunday, June 23, 2013 12:51:48 PM UTC+9, Ben Hundley wrote:

If there are no ports open for communication on the node inside the
company
network, you're pretty much out of luck.

If I'm understanding correctly, you want to use the ES instance running on
your company networks as a client node to push data to the web-server
node.
While this is a common setup and can be achieved in unicast by pointing
your
internal node to the web-server node (acting as master), I think all nodes
in the cluster must be able to both send and receive on the transport
ports
(I could be wrong on this).

Correct, that is what I am trying to achieve. I wouldn't rule out that I am
missing a crucial configuration setting or have a wrong idea about the
communication needs between the nodes.

I was under the impression that an SSH tunnel from inside the firewall out
would suffice, but when I set this up with unicast, [CompanyNode] behind a
firewall and [WebserverNode] not able to contact back, this is in the logs:

[2013-06-23 14:50:34,219][INFO ][discovery.zen ] [CompanyNode]
failed to send join request to master
[[WebserverNode][c2jJ_cydQwezy3CtNpzyuw][inet[/127.0.0.1:9300]]], reason
[org.elasticsearch.transport.RemoteTransportException:
[CompanyNode][inet[/127.0.0.1:9300]][discovery/zen/join];
org.elasticsearch.ElasticSearchIllegalStateException: Node
[[CompanyNode][QHVR0fPxSvqiLohS0A-zmg][inet[/192.168.4.2:9300]]{client=true,
data=false}] not master for join request from
[[CompanyNode][QHVR0fPxSvqiLohS0A-zmg][inet[/192.168.4.2:9300]]{client=true,
data=false}]]

I can telnet into the WebserverNode through the SSH tunnel fine.

However, since you're already using the web-server node to store the data
being searched, why not just script your index updates over HTTP to the
web-server node? Especially if you're not storing the data on the ES node
in the company network -- if that's the case, you should definitely go
ahead
and just index your data over HTTP IMO. And if you need a replica for
failover, just put a node up in the web-server node's data center. You
can
always access that data from within the company network, I'm just not sure
if it can be accessed/manipulated from a client ES instance behind a
firewall.

This would be an option but I worry about overhead when using HTTP
requests. The nature of this data is like log files, with a couple hundred
thousands documents per day needing to be indexed.
Apart from that I need the data on the CompanyNode as well for internal
applications.

I will talk to our admins if it is okay to set up a VPN from the web server
back into the company network, so two-way communication could work.

Anyway, thank you for your help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I was under the impression that an SSH tunnel from inside the firewall out would suffice, but when I set this up with unicast, [CompanyNode] behind a firewall and [WebserverNode] not able to contact back, this is in the logs:

[2013-06-23 14:50:34,219][INFO ][discovery.zen ] [CompanyNode] failed to send join request to master [[WebserverNode][c2jJ_cydQwezy3CtNpzyuw][inet[/127.0.0.1:9300]]], reason [org.elasticsearch.transport.RemoteTransportException: [CompanyNode][inet[/127.0.0.1:9300]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[CompanyNode][QHVR0fPxSvqiLohS0A-zmg][inet[/192.168.4.2:9300]]{client=true, data=false}] not master for join request from [[CompanyNode][QHVR0fPxSvqiLohS0A-zmg][inet[/192.168.4.2:9300]]{client=true, data=false}]]

Not sure if this is the cause of the error, but on the webserver node, for this to work, you must have master set to true. I noticed you had set it to false in the original post. The company node, which is not master, should have the IP of the web node in its unicast settings so it knows where to point to.

This would be an option but I worry about overhead when using HTTP requests. The nature of this data is like log files, with a couple hundred thousands documents per day needing to be indexed. Apart from that I need the data on the CompanyNode as well for internal applications.

I understand the hesitation, though it probably wasn't best for me to phrase it as "send the data over HTTP". Regardless of how you connect to the client node (Java API, HTTP, etc), the data must be pushed over TCP to the data nodes. My point being, the overhead of HTTP is relatively trivial, and you're already having to get the data out of your company's net and to the web node -- only difference is that the client-data cluster setup passes off the data transport job to ES.