Context:
ElasticSearch analysis results must be "day-1" (not real time).
Data source : A db in SqlServer.
Extract, Transform than load the data into ElasticSearch index.
We performed a POC that has yielded interesting results but there we go past to look at things in production.
The problem is that the source data are changed (add, delete, update) but we have no way to calculate the delta between "day" and "day -1" in terms of data updates (no attributes in allowing bdd to do that).
You agree that
Cost of calculation of the delta between two extraction results ("day" and "day-1") +
Cost(loadind data in ES)
could be very consumer.
The remaining solution is to run from scratch every night, ie:
Delete Index (all types, all documents)
Extract data from scratch (from the SqlServer database)
Rebuild the index again.
Having no hindsight, I would like to have your opinion on this strategy.
Is this the best solution? Tahnks
We use this approach in production, but its not right for every use case.
A few things to consider...
How long is your current ETL process? Do you expect the duration to increase? Our ETL is similar to yours - Data is extracted from Oracle/SQL sources, transformed, and persisted in ES. Early versions were tested with relatively small data sets, and the results used to estimate performance in a production environment. In reality, our first deployment was a complete failure. Moving from a few thousand records to several hundred million increased the duration from a few minutes to over 27 hours. At that point we had to basically 'start over'. Redesigning our stack and utilizing Spark for the ELT got us to where we needed to be (we are now under 15 minutes for the same production set), however you should evaluate if technologies available to you. If your job takes >24 hours, daily recreation is not really an option.
Ensure the external sources will be available in the future. Customers sometimes get the impression that once their data is extracted from there current stores they no longer need to be maintained. Explaining to that they need to maintain a stack of ES nodes in addition to their existing infrastructure can cause problems.
Your feedback is very very helpful;
Currently we use Talend for the "Extraction and Transformation" steps.
And the Bulk API for loading.
Obviously, we tested the operation by using a small set of data (size of json file (our index) = 20 MB). but the scaling will be 4000 times larger / day. So yes, I hope not wrong in my technical choice of Talend.
In your solution, you use Spark (SQL) or all the stack of Spark?
Technically we use Spark SQL, however no DataFrames are used explicitly. At some point we will move to the new model (DF) if we can come up with some free time.
As far as Talend goes...I can't recommend it.
A few years ago my company released a similar product with the ETL written in Talend. We quickly learned the solution was incredibly unscalable. Determined that Talend has two fundamental problems:
When jobs are constructed by GUI, developers are disconnected from the code. There is no understanding of the underlying processes/flow that occurs during execution.
Using predefined methods and functionally does not allow the customization required to produce a scaleable, Enterprise grade product.
Spark has proven excellent thus far. After a relatively short introduction all our developers were able to pick it up contribute. Developers know exactly how the process is working (they are in the code!) and the customization is limitless. I highly recommend you experiment with Spark before making a final decision.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.