Periodically Query And Process Only New Records

Hello everybody,

Background:
In our system events arrives in stream. In most cases, we just need to index (Add) the events to new or existsing documnts (Upsert), however there is one event in which we need to update existing documents. For example: we got contant change-sets of "new order arrived", but after a while we might get an event of "order #3123 has cancled" - which means we need to find the old order and update that it was cancled. When using the bulk API, ElasticSearch can't gurentee that all the index operations will arrive BEFORE the update oprations. The simplest and safest solution for that, as far as our understanding, is to delay the update operations and stream them to our system after something like 24 hours. We planned to stream the events to a temporary rollovered index in ElasticSearch instead of streaming them directly to our system immediatly.

Our Questions:

  • How can we query ElasticSearch for events made 24 hours ago, without repeating the same records over and over again?
  • How can we combine a scheduling mechanism with logstash input? Or atleast automatic shutdown after all events have been processed (with the operation system will be incharge for shceduling logstash again).

By the way, I have no problem using logstash with other DB or middleware if it simplifies things.

Ok, wow!
Now I see that the jdbc input plugin is much "smarter" than the ElasticSearch input plugin.
I'll check it out. And here's a github issue for improving existing ElasticSearch plugin.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.