I get the general principle of Mark Harwood's entity-centric indexing pattern, and I've written some code, and it works.
The problem I've got is around the "periodic extracts sorted by entity ID and time" - how do you do this so that you (a) don't miss any data and (b) don't burn resources processing the same data over and over again?
Scenario: log entries get written to a log file, picked up by Filebeat, parsed by Logstash, stored in Elasticsearch. The there's a Python script (loosely based on Mark's example code) to search for relevant new log records and create/update the entity centric documents.
Fine, but, how to search for "new" log records? - at present I'm running the script once a minute, because it seems likely to be a reasonable maximum time for our Ops people to have to wait before being told something is broken, and the script is looking at all documents less than an hour old, based on @timestamp, because I've seen a Filebeat instance getting an hour behind with its processing before now.
The problems with this are obvious:
- if a Filebeat instance were to get more than an hour behind there would be some records that the script would never look at
- because I'm scanning an hour's worth of log documents every minute I'm looking at each relevant one 60 times, which means doing 60 times as much work as necessary.
I can't see any way to (a) guarantee that all new log documents do get looked at by the script and (b) not waste resources repeatedly processing the same data.
One thing that has occurred to me is to stop using @timestamp and instead use an ingest pipeline to generate an ingest timestamp and use that. I can then guess that it will almost never take more than a minute from the ingest timestamp until the document is actually indexed, and make my script look back two minutes based on the ingest timestamp instead of 60 minutes based on @timestamp.
This would, I think, reduce the chances of missing data, but not to zero, and reduce the amount of duplicated wasted work, but not to zero.
What have I missed? Or is that really the best that can be done?