Query on Indexing using es-hadoop

(Vinoth Kannan) #1


I am looking to implement the ELK stack with an existing hadoop cluster.
My main goal is to store the logs in HDFS , use ES for just indexing and show the analytics on Kibana.

I have couple of questions regarding this approach
1.) The es-hadoop connector basically gets the data from HDFS and indexes the data in ES, thereby duplicating the data. The query to ES is not redirected to Hadoop. Am I correct?
2.) The indexes are built every time we run the MR job. How can the data be index on "real-time" as soon as it saves in HDFS?
PS: Our environment doesn't use Storm or Spark streaming

Kindly clarify.

Best Regards
Vinoth Kannan

(Christian Dahlqvist) #2

It is correct that Elasticsearch stores data on its own and do not redirect to Hadoop. If you want to index into Elasticsearch using MapReduce, there is going to be a delay. A common approach when near real-time access to the logs is required is to feed the logs into Hadoop and Elasticsearch in parallel instead of relying on the logs first being loaded into Hadoop.

(Vinoth Kannan) #3

Thanks for the reply Chritsian,

But if I want to use HDFS/Hive as the primary storage and want to index the subset of the actual data in ES, how to realize such an architecture ?

I know its bad in terms of performance, but can ES query/aggregate data from hive and show the result in Kibana?


Vinoth Kannan

(Christian Dahlqvist) #4

In order for you to be able to search through Elasticsearch, the data must be stored in the Elasticsearch indices. The ES-Hadoop connector allows transfer of data between Hadoop and Elasticsearch, but Elasticsearch does not directly access Hadoop. If you require near real-time access to a subset of your data the best way is likely to feed it to Elasticsearch at the same time it is fed to Hadoop. Hadoop will still hold all data and be your primary data store.

If this is not possible and you need to write the data to Hadoop first, it is likely there will be a delay. How long this is depends on how you do the indexing, and using MapReduce jobs can as you initially pointed out be slow.

(Vinoth Kannan) #5

@Christian_Dahlqvist Thanks for clearing that up.

My Hadoop cluster is in a secured Firewall zone and kerberized. The ES cluster is in a different zone. If I want to use the es-hadoop, how to configure the es-hadoop to use a particular port ?

Can the es-hadoop transfer data from a kerberized hadoop cluster to an ES cluster?

Vinoth Kannan

(Costin Leau) #6

Both questions are covered in the reference documentation.

(system) #8