There are a few questions about a use case in which I am presently working, This is related to establishing a data pipeline(Data Integration) from Hadoop to Elasticsearch.
I did a few quick POCs by creating the indices in the hive and exporting the data there, and later viewing them in Kibana. But, I am looking for something in a much broader area, and below are my questions:
What are some best practices which you would suggest to perform this kind of activity? Any reference would surely be helpful.
Where should we perform all the staging, incremental load, and transformation-related activities? (In Hadoop or in Kibana). As per my understanding, this can be done in the hive via HQL/Spark SQL queries which we can schedule using any schedulers. But, is there anything different in Kibana which will have an upper hand over Hadoop?
The Hive environment is hosted in the HDInsight cluster, and I guess the ECS is on-premises. As I have newly joined the team, there is much of the information abstracted. But, I would make sure to get this information asap. Overall, do we need to do anything differently when any cloud environment comes into the picture?
Lastly, as we are moving the data from one place to another, reconciliation plays an important role. Can you suggest/advise a few ways of doing this? I was thinking to include a python script to get the counts from Hive and Kibana and produce a flat-file (probably in a beautiful way. Well I know they don't look that way :P) to store in the Storage Account, and later it can be used by a SQL engine?
If I can get the answers to the above pointers, I can make a great start.