Logstash pipeline filter by date

Hi all,

I have created a logstash pipeline that fetches data from a database (MySQL) and load it into ES.
So it basically consists of a select over a couple of tables and output it to ES. I have noticed that lately this is taking longer and longer so I guess it's fetching all rows and trying to insert new and old entries into ES.

If that's true I need to find a mechanism to avoid fetching entries that are already in ES. What's the most usual pattern for this? I have been thinking about two alternatives:

  • New column in database table to indicate whether it has been loaded in ES or not

  • Keep a datetime in ES or my database in order to fetch this value and use it in the input query

Im not sure about any of those because they make assumptions or modify the stored data. What do you think? Do you have any other alternative?

The second option is a good one. If you maintain a column with the last modified time of the row Logstash can efficiently ask queries that return only rows modified since the last time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.