I am using the jdbc plugin to import data form AMAZON redshift to elasticsearch using logstash.
I am processing incremental updates for a very big table which adds around 2 million rows every hour and has a timestamp attached to each row.
I am facing a problem where since data from redshift is not coming in sorted order, in order to process batch update, using :sql_last_value i have to filter latest 2 million row and then sort it which is taking a lot of time.
Is there any work around for this problem so that the sql_last_value stores the max of the current processed batch rather than storing the last value which requires the input to be sorted on that column assigned to sql_last_value ?
I am facing a problem where since data from redshift is not coming in sorted order, in order to process batch update, using :sql_last_value i have to filter latest 2 million row and then sort it which is taking a lot of time.
Can't you let the jdbc input run more often than once an hour so that each batch becomes smaller?
Is there any work around for this problem so that the sql_last_value stores the max of the current processed batch rather than storing the last value
Sorry, I don't understand the difference.
which requires the input to be sorted on that column assigned to sql_last_value ?
If you're only using a timestamp from a column to keep track of what has been processed I don't see how you can possibly avoid sorting the rows before processing them.
For the second part, since the rows returned are not in sorted order, what value does :sql_last_value srore for the timestamp column assigned to it ? Will it be the timestamp of the last processed row (which might not be the latest time stamp because of redshift) or will it store the maximum of the timestamps processed in the current batch ??
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.