I am looking at ingestion pipeline using logstash into elasticsearch.
Source data comes from sql database (MySQL)
I have used the jdbc input plugin for logstash which is fantastic.
I have a question:
I want to schedule logstash to index the data on a recurring basis. (aka Update existing index) I know that the jdbc plugin has this capability already.
Is this a reindex or new index of the data?
My scheduler prob only has to run once a month.
I have used aliases before and found them useful to atomically switch from one index to the next. I understand logstash does not support that. I would have to use the curator.
So...
Do i simply just go with scheduler or look at using an alias!?
The scheduler will pull any new data it finds, based on the sql_last_value.
You can configure things so that you generate a custom document ID so that it'll update things in ES though.
I found some interesting topics by searching for sql_last_value.
This post below demonstrates using sql_last_value and how to actually update documents.
As you said, it might be best to use mysql primary key as document id to handle any updates.
Only issue I have with that is I remember reading if you let elasticsearch handle document id the bulk index is much faster?
The only action left then is delete. If record has been removed, will logstash handle that also?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.