How to index data when Hadoop and ES are in different clusters

We want to make our data (~10TB) on HDFS interactive query-able, due to the data policy, we won't be able to install ES in the same cluster of Hadoop, so es-hadoop connector is not an option for us. Currently I'm trying to generate index data on Hadoop cluster, cause we have a relatively larger Hadoop cluster (~3K) comparing to ES cluster (up to 20). But have no idea about the size of index data and how to make it available on ES cluster yet (pull via HDFS proxy?).

Do you have any idea?


Hi Kai,
I am not qualified to reply, but I will share what I am doing just for your interest.

My hadoop cluster is 12 big nodes. Each node has 24 cores and 36TB.
My ES cluster is 4 big nodes. Same as above
All interconnected with 10G.

On the 12 hadoop nodes I run Logstash. These point to the ES nodes. 3 hadoop nodes per ES node.
hadoop1 ---> ES1
hadoop2 ---> ES1
hadoop3 ---> ES1
hadoop4 ---> ES2
hadoop5 ---> ES2

On the hadoop side I run a simple streaming mapper that basically does this:
cat | nc localhost 3333

Note the use of netcat to direct the data to the local logstash on port 3333. Logstash then does everything and sends the data using the Bulk API. I have different log sources and it just needs a change of port number for the netcat command. Logstash listens on the various ports, 3333, 3334, 3335 etc

Not very sophisticated but it works, and it pretty easy.

The setup form elastic_paul works. Additionally es-hadoop connector does not require Elasticsearch to sit next to Hadoop.
In can be anywhere where the Hadoop nodes can have access to it - note that the connector supports HTTPS and SOCKS proxies so even in restricted environments, one can route traffic through a certain direction.

A nice thing about Elasticsearch (I'm biased) is that you can play fairly easy with its topology - you can install nodes on the same machine or spread them out. You can even run multiple Elasticsearch clusters on the same machines, no problem.
With Hadoop, one doesn't lose that - you can run it entirely on the same machines as Hadoop, just some nodes or no nodes at all. And the connector doesn't mind - in fact, it supports all the cases above.