The performance of Logstash depends a lot on the configuration so I disagree with your statement. I suspect Logstash is your best bet in order to get data replicated in near real time.
logstash is a good option, I have been using logstash with idbc input for about 8 months with no issues. I recommend tune your SQL query to have index in sql side and as @Badger mentioned, use a timestamp to track deltas change after last execution, something that worked for me is use a limited time range for my where condition, i.e. get only rows modified in the last 10 mins
WHERE ....
AND DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), *my-modify-date-time-sql-column*) >= DATEADD(mi,-10,GETDATE())
I have a sync close to real time, logstash pipeline running every 5 mins.
For scenario 2, where I have to transform all data(millions of records) to docs. For this do you recommend any tool or should I loop through all records and create docs?
@Coder_Cub use Logstash for your 'migration' load, leaving the where condition of your sql input statement open to pull ALL the millions of docs. I've migrated close to 4M of sql rows to elasticsearch in couple of hours using Logstash, the time may vary if you have a transformation process in logstash (mutate, grok, etc..) and analyzer(s) in your elastic index, and use a second pipeline to catch your deltas (new inserts/updates)
As tip, use a unique Id column from SQL to define your document Id in ElasticIndex, this is the value that logstash will use to find the object in Elastic and update if the doc already exist (thinking on update scenarios).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.