I’m looking to deliver an Elasticsearch based search experience off our event driven architecture.
We have a stream of events on Kalfka. Each event represents a state change for a thing that we can call a case. There are many different events, but they can roughly categorized as three sequential states [create -> update -> resolve]. There can be n number of updates in the lifecycle of a case and update is a rough abstraction of various business defined events not explicitly changing the rough 3 steps outlined above.
The documents being indexed are .json and each will be comprised of two distinct sections - a static metadata section outlining various identifiers and elements related to time stamp or origin system. The second half is dynamic and is comprised of key value pairs related to the specific business data for the case at that event.
What I’m looking to solve as the product owner is for two search cases. We will have consumers who are interested in lists of cases aggregated to the current state. So a case that has events of create -> update -> update has a state of update. They would like to search in a keyword method for all cases that may have the value of ‘foo’ in the dynamic business information but not returning the 3 events (documents) instead only receiving a single case (current state / latest event). I think a bucket aggregation may be useful here forming off case_id an element they would all have in common with the same value. I’m not sure.
Now the second case.
Some consumers are very interested in change log like view. Meaning instead of caring about the latest event to describe the state, they are interested in a list off all matching events (documents). A simple search for ’foo’ yields 3 results.
Ok, so what’s so tricky?
It’s seems like I could solve by ingesting two indexes, one which creates documents which satisfy the aggregate view to get at current state. Each event would trigger an update or version to the document. Just look at this transformation based index for all your current state needs and be done. The other would be an index of all the ‘raw’ events.
I’m being asked to avoid two separate indexes out of a concern that they could get out of sync, and that reconciling them would be tough to impossible. To that end a single index of ‘raw’ events that were joined via aggregation at query time is appealing. But alas, we are worried about perf issues at query time based on needing to aggregate MANY events and then return MANY of these case states for a single query, even with pagination.
One further nuisance is the fact that again in our create -> Update1 - Update2 example, may match ‘foo’ against the business data in Update1, but was overwritten in Update2. The hit exists, but it’s not part of the current state, so it should be excluded from results in this case.
There is a state engine upstream of Kafka that knows definitely the state of these cases, but it’s in a slow RDBS and purges it closed events for space considerations, leaving the index as the source of truth for all cases in a closed state. I’m not sure the architecture of the upstream can be changed, I don’t own it, and a change would be later than would be helpful to my project.
This is an enterprise scale application, so the size of the corpus will be large.
Even if you all don’t have a solution, I’d still love to get your observations - I’m really looking for any and all feedback on the challenge I’m facing. My tech lead is a little stumped and I would like to avoid buying professional services for this if possible. I hope this is an intriguing problem that holds your attention.