Continuous document?

Hello,

summary: I am trying to find a way to have many "objects", and query their properties in certain time points.

I'm having a problem implementing ELK and I need some advice.
I have a database that a part of it is defects that are documented, and am using logstash and JDBC to withdraw these defects to elasticsearch, they are constantly changing and being updated(opened, closed, reopened etc).
So far all is good, I can schedule logstash to run every day and only withdraw the update so I get in my database many documents of defects, some of the have the same ID, properties of that defect and the time it was modified. But now I need to go back and query the state of all defects in a certain time (how many opened, how many closed etc). I can't seem to find a good way to do it, since a document is a point in time rather than a property of a certain object.
I realy need an advice how to map/arrange/filter my data in such a way this will be possible.

Hopefully I made myself clear enough.
thank you very much,
aviv

Elasticsearch doesn't have a time machine style function. It'd be neat to think about implementing one but it'd be a big project!

This simplest thing may be to run the reports when you want them and store them.

You could try a solution where each change to a defect is reflected in a new document. If you did that you could write a custom aggregation that gets you the "latest change <= some date" change per defect. That is a reasonably advanced thing to do too but if you had it you could run aggregations on those latest changes. You couldn't do anything else with them though.

You could reindex the whole index into a new index every day or so. Then you could do whatever you wanted with the historical index that you made. But that seems like a pretty heavy solution because you'll be copying the state of the old indexes.

Thank you very much Nik for you response :slight_smile:

If you could elaborate on your second seggestion that would be great.
Currently I create a whole new document for each change. What you said will require checking for each defect ID which has the most recent timestamp (or else all of the documents with that ID will fall into the bucket).
If you think thats possible, it would be greate if you could point me to the right direction as to how to achieve this technically.

You mean "run the reports when you want them and store them"? I just mean kick the reports off via cron or something at regular intervals, store them, and use the stored reports. You can totally store them in elasticsearch and aggregate on the report results if you want.

Should have been more specific, I ment this;

You could try a solution where each change to a defect is reflected in a new document. If you did that you could write a custom aggregation that gets you the "latest change <= some date" change per defect. That is a reasonably advanced thing to do too but if you had it you could run aggregations on those latest changes. You couldn't do anything else with them though.

Sure!

This means an Elasticsearch plugin that registers a new aggregation.

Just building a plugin is more than most folks do and this one is a bit tricky. Problems you'd have to solve are:

  1. Documents are routed based on some hash of their ${type}:${id} so if you have a couple of documents "about" a defect you'd want to make sure they'd get routed to the same node. You'd do this in your application whenever you interact with Elasticsearch.
  2. Iteration order is pretty random so you'd have to have lots of scratch space for the documents and their dates. Elasticsearch has an abstraction called BigArrays for this. Its how you integrate with the circuit breaker to make sure you don't consume too much heap. And it lowers the allocation rate at the cost of making long lived objects.

I don't know the aggregations code super well so I'm sure there are other things I've not thought about. I imagine this is a decent sized project.

I see. Thank you very much Nik for your greate replies!!
have a great week.