I'm considering replicating a MySQL database into elasticsearch to enable Kibana visualizations on that data. There's another application that runs off the database and the existing data gets updated.
It appears that logstash + Jdbc input plugin is the best way for this. I found sql_last_value option that can be used to detect and incrementally update the elastic data. However, my database does not have any column to indicate the updated records.
If the plugin query is just SELECT * FROM my_table, would these incremental updates work? In other words, does the plugin or MySQL have any inbuilt feature that allows the plugin to figure out updated records?
does the plugin or MySQL have any inbuilt feature that allows the plugin to figure out updated records?
No.
What you might be able to do is fetch all rows and store them in ES with the same document id each time, i.e. so you'll be overwriting the same documents over and over again. It's clearly inefficient (perhaps prohibitively so), but if there's no way to figure out the modified rows it's the best you can do.
So, as I understand, if I use the primary key id of my table as the document_id in the elasticsearch output, it would overwrite all documents every time it runs (as per the schedule). Anyway the table size is about 32MB as per MySQL with about 30k records. In my opinion, with a 5 min schedule, this amount of data should be tiny for my single-node elastic cluster.
Of course, this solution would not scale if I also want other tables - I do have one with 3 million records.
I do have the option to modify the database. I came across this and this which suggests using an updatedAt field.
If I do setup my table as suggested, I guess my query would look like SELECT * FROM my_table WHERE updatedAt > :sql_last_value ORDER BY updatedAt. Can you confirm the following? It is not very obvious in the documentation of the plugin.
that the ORDER BY clause is compulsory
using sql_last_value in the query means use_column_value and tracking_column become mandatory
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.