I have several log sources (files) that are being ingested by filebeat. There will be several (say, 5-20) log lines that are all linked by a single field (I'll call it the transaction id). In just one of the many logs there will be another field (e.g. a "url"). I want to be able to add the "url" field to all the log lines for that transaction id.
Here is an example log file:
2021-08-06T12:36 ABCDE - I am starting up
2021-08-06T12:37 ABCDE - I am doing something else
2021-08-06T12:38 VWXYZ - I am starting up
2021-08-06T12:39 VWXYZ - I am doing something else
2021-08-06T12:40 ABCDE - hitting url https://elastic.co
2021-08-06T12:41 VWXYZ - hitting url https://elastic.co
2021-08-06T12:42 ABCDE - I am spinning down
2021-08-06T12:42 VWXYZ - I am spinning down
In this example there are 2 "transaction ids" that correspond to 2 different jobs or threads:
- ABCDE
- VWXYZ
Filebeat would be configured to pick up the log lines and ship them off to logstash. Logstash would then parse them into a date (2021-08-06T12:36), a transaction id (ABCDE), and a message (I am starting up) field. For the messages that include a url (hitting url X), logstash would grok the url and put it into a url field.
Now when querying from Kibana, I can easily filter by the transaction id field to see all the messages that are from the same thread. I can also try to find log lines that have the "https://elastic.co" as the url field. The problem with this design is that when I filter on url, I only get these 2 log lines:
2021-08-06T12:40 ABCDE - hitting url https://elastic.co
2021-08-06T12:41 VWXYZ - hitting url https://elastic.co
If I want to see all the other log lines, I have to manually add those transaction ids to another filter so I can see all the related logs.
So my question is, how can I add the url
field to all the logs?
What I have tried:
- I know this could be possible using the Logstash Aggregate Filter, but that only allows for a single worker thread, and I have multiple Logstash servers that are all load balanced - enriching these documents in parallel.
- I could also write an external program that did the 2-pass query. (look for any of these logs that contain a url, and then do an update on all logs that match its transaction id so they also have the url field). The downside to this is that it would not happen on ingest, and it could quickly become an expensive query depending on how many logs are in Elasticsearch.
- I also looked at moving some of the parsing logic off of logstash and to filebeat in order to use the Script Processor for filebeat, but you can only manipulate the current event, not past events. E.g. I could have a lookup table (this would obviously need some management to prevent an infinitely growing table) in the script processor that would map transaction id to url, and when a url field shows up, add it to the lookup table. When new events for that transaction id come, it could add the url field to the event, but this wouldn't work for the previous events that were already processed before the url was known.
These logs would also be processed relatively close to each other, so there's no need to prevent the logs from ingesting indefinitely if for some reason no url field is found in any log line.