How to hold a state in Logstash?

Hi,

I'm trying to apply a state (with a field in ES) to subsequent log lines after seeing [message] =~ "foo" but have not had luck using class variables with the ruby filter (@@classVariable) as described in Keeping global variables in LS?! with logstash 6.0.1.

I find that the class variable does not hold the correct value when exploring the data on discover in Kibana. Specifically, I want the state to change when I see [message] =~ "bar" and then hold that value until [message] =~ "foo" is seen, however I notice that the state will change value when it should not have. I'm using a different named capture from grok (that I also store in ES) to set the state when it is present, and when the state changes that named capture isn't even present, so something else is changing this class variable.

It's been two years since that post, and this doesn't seem like an unreasonable thing for someone to do, so I'm hoping there's another method now that I haven't found?

Thanks,
Dave

There could be a few things that could impact state keeping in such a way, a prime one of them being concurrency. How many workers is Logstash configured to spawn? If >1, try with a single worker and see if this resolves the issue.

There are a few ways to simulate a state mechanism in Logstash, but most (if not all), have inherent problems with concurrency and out-of-order events, especially if a partitioned Message Queue feeds Logstash.

Is performance critical in your case? Is having multiple workers mandatory?

That makes sense, I wasn't aware of the concurrency in logstash. I tried it with 1 worker and it does appear to be working correctly.

Performance is not super critical but there is a minimum performance I need to hit, I will have to test and determine if the 1 worker solution will work under workload.

Is there no way to use the Update API to apply them after X time? I think one could record state changes and time and then apply that state variable to everything between times.

I assume you're referring to ElasticSearch's Update/Update-by-query API?

Theoretically it can work, but you'd need to keep track of state changes and timestamps and update those in ES asynchronously, but it would probably add complexity to your setup and tax the ES node(s).

Update by query is not exactly a cheap operation, but it always depends on how frequently you expect those states to change (doing so every 5 minutes is probably fine, doing so every 2 seconds is probably a no-go).

The way I'd probably approach it is the following (depending on the availability of resources, and whether the Update API is a feasible solution or not):

  1. Have a single-worker Logstash node do only the calculation you mentioned on already processed logs and benchmark it.
  2. If the single-node single-worker performance is acceptable, offload all other log processing filters on a separate Logstash node, this way you won't limit your entire processing pipeline on a single worker, but you can still avoid a bit more setup complexity.

Basically:

Event source --> Logstash pre-process node --> Logstash state-keeping node --> ES

Yes that's what I'm referring to.

I'm not sure I understand the pipeline you suggest. I think you're saying the work coming into the pre-process node is handled concurrently and then each event is sent to a single threaded state-keeping node. But since the events are processed concurrently in the pre-process node before being sent, doesn't that mean they will still be out of chronological order when they get to the state-keeping node?

Perhaps I'm missing something on how logstash works in this multiple node configuration works. Also, if I want to keep everything on one server, does this effectively mean two logstash processes?

Thanks so much for your help!

Is your source in strict chronological sequence (i.e. tailing a log file)? If so, you are correct, my suggestion above will most likely introduce out-of-order events and should be disregarded.
It's only useful when you just need to maintain order in-Logstash as they appear (if that makes sense).

Though there might also be a middle ground concerning your point about utilizing a single server.
Logstash can support multiple pipelines since 6.x, so you could theoretically segregate your whole process in something like 4 parts and have each one be it's own single-worker pipeline, all wrapped inside a single Logstash process.

Something like this (where each config part is a separate pipeline feeding the next step, and all are contained in a single process):

Input -> Logstash config pt.1 -> Logstash config pt.2 -> ... -> Logstash config pt.x (state-keeping)  -> Output  

That would essentially maintain a single worker per pipeline, while allowing you to fully utilize the server's CPU cores to squeeze out a bit more performance (though all this is speculation on my part, I have not set up such a pipeline myself so no first-hand experience).

Yes the source is a chronological log file, it actually isn't tailed, but rather sent all at once (for the last X hours of logs).

That's an interesting idea about the pipeline, and perhaps worth benchmarking along with the single worker/one config setup.

Thanks for all the suggestions!!!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.