I'm trying to determine if there is a way that I can query an Elasticsearch index for a particular data anomaly we sometimes encounter.
Here is a contrived example of the dataset (with some annotations):
[
{
"_id": 1,
"asset": "foo",
"recorded_date": "2020-09-11T00:00:00.000Z", # start the "clock"
"recorded_ts": 1599782400,
"state": "on"
},
{
"_id": 2,
"asset": "foo",
"recorded_date": "2020-09-11T02:00:00.000Z", # 2 hours <= 3 hours (ignored)
"recorded_ts": 1599789600,
"state": "off"
},
{
"_id": 3,
"asset": "foo",
"recorded_date": "2020-09-11T04:00:00.000Z", # start the "clock"
"recorded_ts": 1599796800,
"state": "on"
},
{
"_id": 4,
"asset": "baz",
"recorded_date": "2020-09-11T08:30:00.000Z", # baz != foo
"recorded_ts": 1599813000,
"state": "off"
},
{
"_id": 5,
"asset": "foo",
"recorded_date": "2020-09-11T14:30:00.000Z", # 10.5 hours > 3 hours (anomaly detected!)
"recorded_ts": 1599834600,
"state": "off"
},
{
"_id": 6,
"asset": "foo",
"recorded_date": "2020-09-11T20:00:00.000Z", # immediately-preceding foo state != on (ignored)
"recorded_ts": 1599854400,
"state": "booting"
},
{
"_id": 7,
"asset": "foo",
"recorded_date": "2020-09-11T23:30:00.000Z", # immediately-preceding foo state != on (ignored)
"recorded_ts": 1599867000,
"state": "off"
}
]
In a nutshell, I want to count how frequently during a specific time range (e.g. 2020-09-01T00:00:00Z to 2020-09-02T00:00:00Z) a particular asset (e.g. foo) has a state equal to on for more than 3 hours.
(Imagine foo as a widget that could potentially overheat if it is on longer than 3 hours.)
Given the aforementioned sample dataset, the query would detect 1 anomaly for asset foo.
In thinking about the problem space I've come up with some criteria in an attempt to help describe what constitutes an anomaly:
-
The current document must not have a
stateequal toon -
The current document must have a sequentially preceding document within the same
recorded_tsperiod with the sameassetvalue
a. The preceding document and the current document could potentially be non-consecutive in order (i.e. records for other assets interleaved in the dataset)
b. Any preceding document outside of therecorded_tsperiod bounds should simply be ignored, and not count towards any anomaly counters -
The time difference between the
recorded_tsfields of the current document and the preceding document must equate to more than3hours
It feels like a "windowing operation" to me, partitioned by asset and ordered by recorded_ts. However, I'm not sure if there is a way to achieve this with the standard DSL. (Perhaps it can be solved with a histogram and aggs?) In reading the docs I'm not able to wrap my mind around how to compose the query.
Apparently, I'm a bit "temporarily-challenged" so hopefully someone in the community who enjoys puzzles can help shed some light? 
Thank you!
