I have quite tipical architecture: Filebit -> Logstash -> Logstash -> Elasticsearch
First and second logstash makes some sql queries to database during its work (via plugin).
But the problem is that I need introduce 10 second delay between processing documents in first and second logstash. So whan document A was processed in first logstash, then the same document A should be processed in second logstash more then 10 second after.
What is the simplest way to achive that? Maybe I should put something between these 2 logstashes? Kafka with implemented delay? Another logstash with any delay plugin? Maybe somone have experiance in such problems
Why do you need to introduce a 10 second delay? As far as I know there is no way of doing this so understanding where the requirement comes from might help find a workaround.
As I understand how this plugin works - it of course cause 10 second delay but will significantly reduce the speed of processing (or maybe I'm wrong?)
Why do you need to introduce a 10 second delay?
Yes, I can explain.
I'm processing logs from a system. For one event in a system there are 5 types of log. These logs contains different information about this event, but all this logs contains same userId for same event. So one logs contain part of info about event, another log contain another info. I want to unify all logs with all possible knowledge taken from the same event. In a first logstash I want to make SQL inserts to any database to gather all information about whole event for certain user. I'm sure that after 5-10 second all logs for this event were done and I have in database all possible info about event. So I can start processing in second logstash and using SQL select I can put to my logs some additional information which were taken from another type of log. I can say that I want unify (enrich) every log with all possible information.
I'm wondering if it is possible in logstash ecosystem. I thinking about :
Adding any kind of delay to the Logstash pipeline will slow down all processing tremendously, so I do not think this is a good or viable solution. I am not sure if Kafka has any delay capabilities, so will leave that option out for others to comment on.
One option could perhaps be to create a more complex SQL statement that only extracts records that are deemed to have been completed, e.g. by not having had any new information added for 5-10 seconds. I suspect such approach may lead to duplicates as it may be hard to make sure all data is captured exactly once, but if you index the data with an ID built on data in the event, each reprocessed event could cause an update rather than a duplicate.
Another approach could be to use scripted updates to incrementally add new information to the event as data come in. This would also require the base document to have a known ID, e.g. transaction ID and/or date, that is known to all components.
The third approach is to have a process that periodically scans Elasticsearch for newly arrived data and performs the update/assembly of events behind the scenes. This required the development of an application/script and is not real-time, but can be made very efficient and flexible.
All these approaches allow Logstash to work at optimal throughput.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.