Hi there, I hope this topic will not sound too much general, but I'm opening it because I think it could be beneficial for a general audience to discuss.
I'm currently working on a big data project which involves the analysis of large amount of data coming from several sources. The main source being CSV files.
At the moment, CSV are pre-processed using Python, and then ingested in a Elasticsearch cluster using Logstash. Given that there are several data sources and types, Logstash also makes data enrichment by querying Elasticsearch (https://www.elastic.co/guide/en/logstash/current/plugins-filters-elasticsearch.html). Once the data is on Elasticsearch, Kibana is used to build nice visualizations.
BTW: Logstash, Elasticsearch and Kibana are running in Docker containers.
We are quite happy with this solution, however there are some drawbacks:
- It is difficult and time-consuming to design the Python code which pre-processes the data (e.g. we are manipulating large CSVs in order to get cleaner data);
- It is not possible to do step (1) using a graphical front-end, which could open up this stage to a much general audience of people not aware of Python;
- Logstash filters have to be adapted depending on the source file to be ingested: this sounds like a silly comment, but this means that one has to change Logstash pipelines if the source is slightly modified
The "best" solution would be:
- To have a web GUI where one could upload or access raw data files, manipulating them (for example creating joins between tables, groupby operations, etc.)
- To do basic visualizations from this web GUI (e.g. histograms, etc.)
- To send "clean" data easily to Elasticsearch for indexing and for Kibana to create the final visualizazions
One solution we have considered is to create a Hadoop cluster, and using Hue (http://gethue.com/) as a front-end to manipulate the data, given that Elasticsearch provides integration with Hadoop (https://www.elastic.co/products/hadoop). This approach however has some disadvantages:
- Hadoop is not recommended for data lakes of small/medium size
- It is time-consuming and complex to build all the Hadoop cluster with ancillary systems (Hive, Spark, etc.) especially because we want to use Docker containers
We also considered to use a relational (mySQL) or non-relational (MongoDB?) database to store all the raw data, and then find a nice front-end to manipulate it. However, the panorama is quite vast and we are not sure about which product to choose, especially with respect to the integration with Elasticsearch (e.g. MongoDB seems to be not easy to integrate with Elasticsearch).
Thus, I'm opening this topic to collect suggestions and comments from your side guys! Have you experience of such solutions? What would you do?
Thanks a lot!