How to push data from Hadoop to ES?

I've already gone through the guide proposed by ES but then i'm still quite uncertain on how this works.

I'm trying to send data from hadoop to my ES index. Is this possible?

So this is what I have tried so far:

I'm using Hive in order to do this. So as of now, I've simply created an external table from the Hive shell.

CREATE EXTERNAL TABLE eshadoop (id BIGINT, name STRING, time timestamp, url STRING) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource.write' = 'eshadooptest/eshadoop', '' = 'true', 'es.nodes.wan.only' = 'false', 'es.nodes' = 'localhost');

So what I expect from the above query is to create an index named eshadooptest in my elasticsearch instance with the above mentioned fields. But then it doesn't create the expected index. But then the table gets created and I could still see it in my metastore. I've got a sample log file (an apache log) uploaded in to my hdfs too.

What I wanted to know is, how am I going to push a log (the apache log i mentioned above) or a document which is in the HDFS, to an ES index which I'm creating using Hive or in the ES itself. Do I have to insert the data which is in hdfs to Hive first, and then push it to ES or could I directly do it?

Please do bare with me if I'm on the wrong track, since I'm still a noob. Thanks!

Creating a table in hive does not necessarily allocate and initialize the index in Elasticsearch. Try inserting data into the table you've created. It should show up after that.

An extra bit of advice: since you're writing to Elasticsearch using dates, it might make sense to create the index in Elasticsearch before hand so that your mappings are correct.

1 Like

Thank you so much @james.baiera for the quick response :slight_smile:

Yep I'll go with that then. So I'll create the index beforehand with the appropriate mappings for the fields which I'm going to create in the hive table as well.

But my concern is let's say I've got this apache log in my hdfs directory, and I want to insert only the necessary items such as (host, port, log-type, etc.) from the log into my elasticsearch index.

Whereas let's assume that I'm having host, port, log-type etc as my fields in ES plus as columns in my hive table as well. So I assume it's not possible for me to directly push the values for the above columns directly into my ES fields through Hive.

So I should be having let's say a java program which could be a Spark application in order to process the apache log from hdfs and insert only the necessary items into the hive table columns. So there after I'll be able to send the data to my ES index fields?

Would that be the correct way? Thanks again!

If you're already processing the log data in Spark, you can use the ES-Hadoop library to load the data into Elasticsearch at the end of your Spark job. You don't necessarily have to push it to Hive before loading to Elasticsearch.

1 Like

@james.baiera thank you so much. :slight_smile:

If you could point me towards a head start, in processing data from hdfs and migrating it to ES using the ES-Hadoop mediator?

Thank you.

@Kulasangar_Gowrisang Our docs are pretty comprehensive of what features we support in the ES-Hadoop connector, but for questions about processing libraries, you're best off checking their respective documentation.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.