Using painless to calculate durations

(Phil) #1

I have several different indices grabbing events that have definitive start and stop points (I have an eventKey field that says "started" or "ended").

Each field also has a unique eventId field so that all things that happen in a given event have the same ID.

Is there a way via painless where if the eventKey field == "ended" I can pull the event start and end times and calculate the duration of these events?

I can reference the values of the current line (if (doc['eventKey'].value == "ended") { return doc['eventId'].value } else { return "null" }), however I cannot seem to find a way to reference the "started" event so I can do a time calculation.

Suggestions? (I am aware of the elapsed plugin, I am hoping to do this without modifying my logstash setup).

(Mark Harwood) #2

This question looks like another case of "joining related records based on a high cardinality ID"

See How to search relationship between log lines

(Phil) #3

You are definitely right, this falls into that category.

Given what that video says, I will go ahead and modify my logstash to use elapsed and reindex all my records (was hoping to avoid this, but there's really no other way).


(Mark Harwood) #4

I'm not a logstash expert but it's worth investigating the issues of robustness around that approach.

If you're relying on an in-memory cache to connect events a1 and a2 and events together then you have to ask yourself these questions:

  1. what happens when the logstash process dies with a buffer full of a1, z1 etc but no a2, z2s to complete them?
  2. what happens if a1 and a2 are very far apart in time? How much RAM do you use to keep [x]1's around waiting for the corresponding [x]2s ? What happens if you purge that RAM?
  3. What happens if one logstash process requires too much in the way of resources to do this join? How do I route events to multiple logstash workers based on ID? (I imagine this is possible but you may need to consider).

Re point 2 - a quick search suggests that by default a1 will hang around indefinitely in RAM waiting for a2 which has previously caused memory issues - see here. Again, I'm not a logstash expert but I assume you'll have to pick between risking overloading RAM or adopting a buffer-ageing policy that can potentially lose data.

The more complex architecture I outline in the video doesn't rely on fallible RAM buffers but may be more work for you to implement.

(Phil) #5

Well, it's certainly not the most robust. Yes, if an LS instance crashes that is a problem, and multiple LS instances will make sure that some events do not get tagged properly.

As far as memory goes, I run 2 CPU/4GB machines, so I am not too worried there.

I will probably have someone on my team run this data through one of our Hadoop stream processors to catch all the scenarios we need when they have time.

(Mark Harwood) #6

I was looking at an OOM report earlier with the culprit being 3GB assigned to this filter. Obviously the issue depends on how many events don't get paired but I think the default config may hold on to candidates forever waiting to pair an "end" event.

(Phil) #7

Actually, the default according to documentation is 1800 seconds (30 minutes). Mine needs to span 24 hours as some of the events I am tracking are long running.

(Mark Harwood) #8

The timeout doesn’t work unless you explicitly set ‘periodic_flush’ to true according to this:

(Phil) #9

I found a more elegant way of achieving what I want, and it was there in plain sight.

I can use the "elasticsearch" filter to query ES for my start event when I see my end event to calculate my duration. The example given on the filter page is exactly my use case.

(Mark Walkom) #10

How to calculate delta value might also be useful.

(Mark Harwood) #11

Good to know. A couple of performance concerns to check for a high throughput system:

  1. I'd be interested to see if that filter uses any "batching" under the covers to avoid many fine-grained search requests to elasticsearch. Something like msearch.
  2. Search operations will hit all shards whereas a "GET" is automatically routed to the one and only shard containing the doc with that ID.

(Phil) #12

@warkolm - I tried using painless to do that, but I was getting some very unusual results when I tried to chart the calculated number (bar chart of painless calculated vs. bar chart of logstash calculated look VERY different). It is entirely possible I missed something in the way I did it.

@Mark_Harwood - Not sure if I can make the elasticsearch filter do a GET - that would be handy. From what I see in the logs, I would say it uses some method of streamlining queries, because once I start the pipeline, I see 4 lines of this:

[2018-02-13T14:20:03,810][INFO ][logstash.filters.elasticsearch] New ElasticSearch filter client {:hosts=>["hanalytics-elasticsearch6-scus-1:9200"]}
[2018-02-13T14:20:04,118][INFO ][logstash.filters.elasticsearch] New ElasticSearch filter client {:hosts=>["hanalytics-elasticsearch6-scus-1:9200"]}
[2018-02-13T14:20:05,200][INFO ][logstash.filters.elasticsearch] New ElasticSearch filter client {:hosts=>["hanalytics-elasticsearch6-scus-1:9200"]}
[2018-02-13T14:20:05,243][INFO ][logstash.filters.elasticsearch] New ElasticSearch filter client {:hosts=>["hanalytics-elasticsearch6-scus-1:9200"]}

The only issue I see at this point is that some event durations are so short, that the start record has not been indexed so no duration gets calculated (my index refresh time is 10 seconds, I could lower this at the expense of CPU). I am adding a field to these events so I can secondarily process them if they are a high number comparatively.

(Phil) #13

I just looked at my pipeline times.

Elapsed was taking around .1 ms to execute.

The Elastic query takes about 10.5ms.

This does not bother me however since the second method works when using multiple logstash pipelines. Elapsed would not have scaled beyond a single logstash instance.

(Mark Harwood) #14

That looks like a pool of connections that can be used concurrently but the code looks to use a single request/response look up rather than a batched msearch request of any kind. So you have parallelism but using individual requests.

That is a problem. Note that the entity-centric approach I outlined offers:

  • batching of update requests
  • in-place, parallelised updates across shards
  • no issues with refresh settings (updates != searches when it comes to visibility concerns)
  • no issues with log sequencing (end log can be processed before receipt of start log)

(Phil) #15

@Mark_Harwood: I looked at that, and yes we could take advantage of that. However our use of Logstash to do this is short term. We will be moving a lot of this into Apex stream processing to monitor the stats from each active event. Once an event is complete, Apex will send the rolled up event into ES.

I'm just trying to band-aid the initial setup so some people can see some of what we're going to be able to show them stats-wise. I'm betting this band-aid might live 6-12 months.

(Mark Harwood) #16

Understood. Thanks for the feedback. It has been useful to me to re-examine the available options for solving this problem.

(Phil) #17

@Mark_Harwood: one of my team members just told me of a solution that would allow elapsed to work across multiple instances.

Since we use kafka, we can take the unique event id, and make that the key in Kafka for each message. This forces all messages with a single key into a single Kafka partition. This in turn ensures that the same logstash instance processes all messages for a given event key.

The only remaining drawback is that the elapsed filter generates one giant pile of logstash-plain.log messages (every time it's used - and with over 50 million events in the pipline, thats a LOT of extra log messages that simply say "I was used" - if I could stop logging for elapsed I'd be in business). I don't really want to add that kind of log traffic into my logging system.

(Mark Harwood) #18

One for the logstash forums I expect.