I'm trying to determine if there is a way that I can query an Elasticsearch index for a particular data anomaly we sometimes encounter.
Here is a contrived example of the dataset (with some annotations):
[
{
"_id": 1,
"asset": "foo",
"recorded_date": "2020-09-11T00:00:00.000Z", # start the "clock"
"recorded_ts": 1599782400,
"state": "on"
},
{
"_id": 2,
"asset": "foo",
"recorded_date": "2020-09-11T02:00:00.000Z", # 2 hours <= 3 hours (ignored)
"recorded_ts": 1599789600,
"state": "off"
},
{
"_id": 3,
"asset": "foo",
"recorded_date": "2020-09-11T04:00:00.000Z", # start the "clock"
"recorded_ts": 1599796800,
"state": "on"
},
{
"_id": 4,
"asset": "baz",
"recorded_date": "2020-09-11T08:30:00.000Z", # baz != foo
"recorded_ts": 1599813000,
"state": "off"
},
{
"_id": 5,
"asset": "foo",
"recorded_date": "2020-09-11T14:30:00.000Z", # 10.5 hours > 3 hours (anomaly detected!)
"recorded_ts": 1599834600,
"state": "off"
},
{
"_id": 6,
"asset": "foo",
"recorded_date": "2020-09-11T20:00:00.000Z", # immediately-preceding foo state != on (ignored)
"recorded_ts": 1599854400,
"state": "booting"
},
{
"_id": 7,
"asset": "foo",
"recorded_date": "2020-09-11T23:30:00.000Z", # immediately-preceding foo state != on (ignored)
"recorded_ts": 1599867000,
"state": "off"
}
]
In a nutshell, I want to count how frequently during a specific time range (e.g. 2020-09-01T00:00:00Z
to 2020-09-02T00:00:00Z
) a particular asset (e.g. foo
) has a state
equal to on
for more than 3
hours.
(Imagine foo
as a widget that could potentially overheat if it is on
longer than 3
hours.)
Given the aforementioned sample dataset, the query would detect 1
anomaly for asset foo
.
In thinking about the problem space I've come up with some criteria in an attempt to help describe what constitutes an anomaly:
-
The current document must not have a
state
equal toon
-
The current document must have a sequentially preceding document within the same
recorded_ts
period with the sameasset
value
a. The preceding document and the current document could potentially be non-consecutive in order (i.e. records for other assets interleaved in the dataset)
b. Any preceding document outside of therecorded_ts
period bounds should simply be ignored, and not count towards any anomaly counters -
The time difference between the
recorded_ts
fields of the current document and the preceding document must equate to more than3
hours
It feels like a "windowing operation" to me, partitioned by asset
and ordered by recorded_ts
. However, I'm not sure if there is a way to achieve this with the standard DSL. (Perhaps it can be solved with a histogram and aggs?) In reading the docs I'm not able to wrap my mind around how to compose the query.
Apparently, I'm a bit "temporarily-challenged" so hopefully someone in the community who enjoys puzzles can help shed some light?
Thank you!