ElasticSearch for realtime metrics/KPIs of a system

Hi! I need an advice.

This is a very simplified version of what I have.

I've developed an API that returns some metrics of things that happens on my system. For example, how many time the user is logged, the number of time he logs, during how long he uses a certain functionality, and so on.

My systems sends events to an RabbitMQ and I have a service listening for those events and simply add them to an index. The events are raw and they simply indicates "User A logged in at 3pm", "User A logged out at 4pm", "User A entendered in the monitoring section at 3:15pm".

My API will then perform calculations over those index entries locally. To extract some metrics sometimes we perform searches with aggregations.

The user can request information within a time range (for example, give me the max time that the user was logged in last year) or he simply might want to know if the user is currently logged in.

We are now facing performance issues. There are a lot of entries in the indexes and performing a query is getting expensive both in ElasticSearch and in the API.

The first problem is that I might have 100 users watching a dashboard and this makes 100 requests that end up doing the exact 100 ElasticSearch results. I've Googled a bit but I didn't found a cache mechanism that can see that the request is exactly the same. Do you know of something like that? However, using caches will loose the effect of the realtime information. So, any suggestion to overcome this problem? Is there a way for Elastic to see that an exact query is already being made and don't do it again?

The second problem is that sometimes I have to retrieve all the documents in one or more indexes to perform calculations over it. For example, to extract the logged in time, I have to read all the documents of that user in the specified timerange and sum the times. What is you advice on this? Should I have a kind of "snapshot" with precalculated values? Can I do this kind of calculations during the search?

Thank you in advance for the help.

Elasticsearch does cache aggregation results automatically, so there is something in place. You might just be overloading your cluster?

Take a look at rollups or transforms - Roll up or transform your data | Elasticsearch Guide [8.11] | Elastic.

Thanks for replying.
I’ve noticed that there is a cache but this is real time events and we are continuously adding new events to indices. As we add new documents the aggregations need to be calculated every time. Right? So, in this case I think that the cache won’t work.

What do you mean by overloading?

I will take a loop on the rollover function. I tought that it was related with ILM with Data Streams but it seems to be another thing.

Thanks

Only on the new documents.
The cache is done on the segment level fwiw.

Am I getting this right: All your queries are centric around a user id? In that case a so called entity-centric index built using a transform might indeed be something to look at. A transform can built an self-updating view on your data and pre-compute the metrics you are looking for. As said, this builds a view: Even if that view can't answer everything, it can answer a significant share. You still have the source data available to answer the more complicated ones.

1 Like

I think I'll give you an example.
My system sends login events like this:

"@timestamp": "2021-01-11T07:15:41",
"agentID": 3,
"loggedState": 1001,
"previousStateDateTime": "2021-01-11T07:12:26",
"sequenceNumber": 2235,
"sequenceID": 100000007

I need to calculate for a certain period what is the logged in period. I can easily do this with a script field that it's "@timestamp - previousStateDateTime". Then I can aggregate by a sum, for example.

The problem is that the agent can login on 2021-01-10T20:00:00:00 and logout on 2021-01-11T07:00:00. If filtering the data from 2021-01-11T00:00:00 to 2021-01-11T23:59:59 with the scripted field I will get a login time of 11 hours, but, in fact, the login time should only be 7h (from 00:00:00 to 07:00:00).

This kinds of ruins my calculations :(.

Thanks for the example. What transform can provide is continuously querying the source index in order to create a document by e.g. agent. However, as I guess an agent can log in and out multiple times, you probably need another criteria. Do you have a field for some sort of session id? If so I would group by agent and session id.

In the end you can create something like:

"agentID": 3,
"timeLoggedIn": 42,
"lastSeen": "2021-01-11T07:12:26"
"totalEvents": 99,
...

Transform supports a lot of aggregations, so you can re-use whatever you already have. The benefit is an index with pre-computed data. A search on that index will be faster, especially if you repeat this query often.

When I say Transform continuously queries the source index, note that it does it in a smart way. It only updates changed session information, it won't recompute everything all the time. To some extent setting up a transform can be seen as a way to cache repeated aggregations. However, if you take it further, you can run 2ndary analysis like calculating the average session time over all agents, which is only possible by pivoting the data with a transform first.

I see what you mean. Unfortunately I think it's not that simple :(.
In my last example, there is a problem with the logins. For example, the last entry might be a login at 2021-01-11T17:00:00 but there is no document for a logout in that time range. This means that the agent is still logged when I queried Elastic.

So, if the agent is logged at the moment, everytime I hit refresh (or perform a query) I should be able to see the "timeLoggedIn" increasing... Like a real time view of how many time this agent is logged in.

I'm not seeing a way of doing this. I'm doing a Scripted Metric to try to workaround this but my main concern is that I'm overloading Elastic with this when I can simply do this in the backend system that reads from Elastic.

Does your system still send regular intervals? If so, you could calculate a intermediate session time in the case you haven't received a logged out event. The same way you do it right now in your query, however this requires that your system sends some sort of heartbeat.

The other option: you put a firstSeen field into the transform and calculate the session time with your query or as part of your backend. This is still fundamentally cheaper than doing a search with lots of documents.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.