How to aggregate based on latest value?

Hi,

I am trying to understand if I can do the following aggregation.
Let's say I have the following documents:

{
	"_id": "auy6a25C7ofiYFJ1f",
	"my_id": 1,
	"type": "A"
	"timestamp": "2018-10-27 10:00" 
},
{
	"_id": "leYdh6zELBawR1iaK7",
	"my_id": 1,
	"type": "B"
	"timestamp": "2018-10-28 10:00" 
},
{
	"_id": "I96fxaKgVwDElW2vTx",
	"my_id": 1,
	"type": "C"
	"timestamp": "2018-10-29 10:00" 
},
{
	"_id": "niaM0ncHot0JDJL4BH",
	"my_id": 2,
	"type": "A"
	"timestamp": "2018-10-27 10:00" 
},
{
	"_id": "K976Y3B9yTWs4jPavm",
	"my_id": 2,
	"type": "B"
	"timestamp": "2018-10-28 10:00" 
}

I want to create an aggregation on type but with context of my_id field.
So if I have 2 ids of elements, I want the aggregation to consider that when my_idis 1 then it's type is "C" (latest) and when my_id is 2 then it's type is "B" (latest).
This is based on the latest timestamp of my element.

In the end, I want to be able to get the following result:

"A": 0
"B": 2
"C": 3

Is this possible?
Thanks!!!!!

How many unique ids have you got?

@Mark_Harwood On average should be 500K But can get to 1 - 5 million unique ids.

Tricky. This is the classic "bucket explosion" problem when trying to do behavioural analytics on raw log data. Likely you'll need to create an "entity centric" index to sit alongside your event store.

See this video for background and example scripts and data.

@Mark_Harwood This is super cool! and it might work for me.
There is still something that is missing, I think.
If I will go with this approach, I will end up with 2 indexes, one for my logs, and one for my entity.

So one index will have:

{
	"_id": "auy6a25C7ofiYFJ1f",
	"my_id": 1,
	"type": "A"
	"timestamp": "2018-10-27 10:00" 
},
{
	"_id": "leYdh6zELBawR1iaK7",
	"my_id": 1,
	"type": "B"
	"timestamp": "2018-10-28 10:00" 
},
{
	"_id": "I96fxaKgVwDElW2vTx",
	"my_id": 1,
	"type": "C"
	"timestamp": "2018-10-29 10:00" 
},
{
	"_id": "niaM0ncHot0JDJL4BH",
	"my_id": 2,
	"type": "A"
	"timestamp": "2018-10-27 10:00" 
},
{
	"_id": "K976Y3B9yTWs4jPavm",
	"my_id": 2,
	"type": "B"
	"timestamp": "2018-10-28 10:00" 
}

And the other will have:

{
    "my_id" :1,
    "type": "C"
},
{
    "my_id": 2,
    "type", "B"
}

Which is cool. But then I how can I do the following:
filter logs based on timestamp query for example, and then only on entities which have been seen in this window do the aggregation by type. (essentially I need some connection between the 2 entities)

So if I want to filter all logs which timestamp is greater and equal to "2018-10-28 10:00" and then group by entity type.
The end result should be:

"B": 2
"C": 2

What do you think?

Not sure exactly what business problem you're trying solve but you don't need to keep just one "type" attribute. You could hold "lastSeen", "firstSeen" dates and "minutesInStateB" durations etc if they are of use. If I understand your problem correctly you'd want to summarise entity types by those entities with a "lastSeen" date greater than or equal to "2018-10-28 10:00" .

GET entities/_search
{
  "query": {
	"range": {
	  "lastSeen": {
		"gte":"2018-10-28 10:00"
	  }
	}
  },
  "aggs":{
	"types":{
	  "terms":{
		"field":"type"
	  }
	}
  }
}

Think of it as an IOT device. I have logs and an IOT device entity.
I want to query for all the logs between date x and date y, and I want to take all relevant devices and group by their type.

In SQL it would be:

SELECT d.type, count(*) FROM log as l,device as d 
WHERE  d.device_id = l.device_id
AND l.timestamp > '2018-07-20 10:00'
AND l.timestamp <  '2018-07-21 10:00'
GROUP BY d.type

Going with your suggestion I might have a field on the device entity with all the timestamps of logs it was seen. it can get up to a couple of thousands. and then I can do:

GET entities/_search
{
  "query": {
	"range": {
	  "lastSeen": {
		"gte":"2018-10-28 10:00",
        "lte": "2018-10-29 10:00"
	  }
	}
  },
  "aggs":{
	"types":{
	  "terms":{
		"field":"type"
	  }
	}
  }
}

And it might work. but as long as I expend my filter on the logs it becomes harder right?
I mean if I want to do the following:
query for all the logs of log_type "x" and severity which equal 7 from date x and date y, and I want to take all relevant devices and group by their type (device_type).

In SQL it will be:

SELECT d.type, count(*) FROM log as l,device as d 
WHERE  d.device_id = l.device_id
AND l.log_type = 'X'
AND l.severity = 7
AND l.timestamp > '2018-07-20 10:00'
AND l.timestamp <  '2018-07-21 10:00'
GROUP BY d.type

How would I do that?

The key bit of your question is about type which changes over time so there's a number of open questions here. When you say find "all the logs of type X" and "group by type" you need to qualify for each of these expressions if you mean the type as at:
a) the time of the sev 7 log
b) date Y
c) now

So I want all logs of type X and do a group by the device type.
I assume the most reasonable option would be option b, which is the type of the device on date y.
The idea is to do the actual group by on the latest term in that time window, which is option b.

If this will be too hard to implement, I can go with option c which is now.
But option b is what I need.

Any idea on how to implement that?

Does it mean I need to keep make every timestamp term into a complex object of timestamp and device_type?

option b may be hard if the date Y is any random choice of time value.

How many events per device? If they are smallish in number you could use nested event logs in a device entity so that your sev 7 and date x to y criteria could efficiently test properties of nested event objects but report back a match on the containing device and its lastSeen type

@Mark_Harwood I might have 400K events per device and more but an average of 10K..Sound big..:expressionless:
Any more tricks we can do? I am dying to make this happened!

yeah.

I expect your app will have to do 2 query passes - one on the event store to get bad device IDs then one or more queries (depending on volumes of ids) to get latest device types of given IDs on or before date Y.

I see. Amm this might work.
Just so I will understand what are my options, but how many ids should be used in one query?
Can I get a list of 200K ids and query the other index with it?

That's a number you'll have to experiment with I'm afraid. It will depend on how many shards you have. There's limits on how many "buckets" an aggregation can make. For each of your IDs you'll need to get the type of the max date (last sighting) for the query's date range

@Mark_Harwood That might (maybe not optimal) for search, but I cannot do group by this way.
This mean that I will need to do the group by in the application level?

Also if I will need the latest type, and not the one which is the latest in that time window, I will need to do the same right? If I use the latest it won't help in any way?

Yep. Unless you hit an entity-centric index for the second part of the query where you only store the "lastSeen" type. In that case you'd only return aggregate counts for each type (whereas the event-centric store would more verbosely give you the latest types for each id). Either way, with multiple requests the client app will have to do some of this grouping.

I see. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.