I'm involved in a project where data is aggregated from a wide variety of datasources into an SQL Server instance. The aggregation is continuous (via a polling service), and involves large volumes of data (millions of records). A set of Elastic Search indices are populated from the SQL Server data with each index containing a graph of data from across several sql tables.. Elastic search is then used to provide fast searching of records (speed is of the essence here).
The mechanism for populating and maintaining the elastic search index is under review, and this is the reason for posting:
What options are there for maintaining the elastic search index in this scenario?
Is there anything considered as best-practice?
I realise this is somewhat of an open question, but I'm looking for suggestions or opinions on what would be worth investigating. I've read some other posts on the subject, but they suggest using "rivers" which I think have now been deprecated.
It would be great if we can get some suggestions from the community...
I've just seen a webinar on using Spark to 'join' data and load into Elasticsearch. Is this a viable option for querying data from multiple relational schemas in SQL Server, sticking it together into objects and indexing into ES.
Basically, I'd recommend modifying the application layer if possible and send data to elasticsearch in the same "transaction" as you are sending your data to the database.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.