There are a few questions about a use case in which I am presently working, This is related to establishing a data pipeline(Data Integration) from Hadoop to Elasticsearch.
I did a few quick POCs by creating the indices in the hive and exporting the data there, and later viewing them in Kibana. But, I am looking for something in a much broader area, and below are my questions:
What are some best practices which you would suggest to perform this kind of activity? Any reference would surely be helpful.
Where should we perform all the staging, incremental load, and transformation-related activities? (In Hadoop or in Kibana). As per my understanding, this can be done in the hive via HQL/Spark SQL queries which we can schedule using any schedulers. But, is there anything different in Kibana which will have an upper hand over Hadoop?
The Hive environment is hosted in the HDInsight cluster, and I guess the ECS is on-premises. As I have newly joined the team, there is much of the information abstracted. But, I would make sure to get this information asap. Overall, do we need to do anything differently when any cloud environment comes into the picture?
Lastly, as we are moving the data from one place to another, reconciliation plays an important role. Can you suggest/advise a few ways of doing this? I was thinking to include a python script to get the counts from Hive and Kibana and produce a flat-file (probably in a beautiful way. Well I know they don't look that way :P) to store in the Storage Account, and later it can be used by a SQL engine?
If I can get the answers to the above pointers, I can make a great start.
Hi @Kishanu_Bhattacharya . Those are some big questions! I've tried to answer below, but my answers are definitely not complete. As you suspect, there is a lot to think through here.
It sounds like you're just beginning your project, so I would highly recommend using spark over hive if that is possible. They both run on a hadoop yarn cluster, but spark is usually faster and probably better supported by the community (including the Elasticsearch-hadoop project). I'm not sure what kind of best practices you're looking for exactly, but definitely make sure you throttle your data load into Elasticsearch so that you don't impact your users too badly (or worse, bring down the cluster).
Since it sounds like you're in a hadoop cluster, I would probably stage the data in HDFS (although that might not be necessary, depending on where your data is coming from). If you want to use spark, you might look into spark structured streaming. If possible, you might consider timing your data loads to times of the day when your application will have the fewest users. If you want an Elasticsearch-native alternative to the hadoop ecosystem, Ingest pipelines are an alternative -- Ingest pipelines | Elasticsearch Guide [master] | Elastic. But if you have a large amount of data and already have a yarn cluster, hive/spark are probably the way to go.
Make sure your firewall is configured to allow all of your Hadoop nodes to talk to all of your Elasticsearch nodes on port 9200 (and 9300, depending on what you're doing).
By reconciliation, I assume you mean a report that shows what happened to every input document. This is not easy, and it is not supported by ingest pipelines. What I have done in the past is write out success or failure for each document at each stage of your spark or hadoop pipeline in sequence files in HDFS. Then after you have completed loading into Elasticsearch, run a subsequent spark job that reads those sequence files and finds the last known state of each document. And finally, a spark job that loads the results into a SQL database for reporting.
Thanks, Keith for the explanation and for describing the best possible ways to perform this.
Yes, the use case is a brand new one, but we already have an existing Hadoop Cluster. I will definitely explore the spark options, as I feel you are absolutely right when comparing it with Hive.
Also, the recon process looks promising. I will try implementing this with the same and will keep you posted on this if all goes help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.