I tried looking for some documentation about time units limitations but I found nothing. I really need to have this kind of precision because I have 2 millions data scattered through a 10 minutes range.
Is there a way to set this bucket span or is that blocked for a reason?
We've seen requirements for sub-second bucket_span resolution a few times in the past - would be curious to hear your use case and we can determine if support for sub-second is really necessary or not.
Thanks for the quick reply, we need this millisecond / microsecond precision for a CAN network used for an automotive machine-learning, alerting and dashboard solution. Right now we rescaled our data from milliseconds to seconds using a logstash pipeline to reindex our data, passing from a minutes range to a 10 days range.
Anyway this solution doesn’t allow us to make a realtime job, just analyze data in the past.
Hope you will understand our necessity
I guess I don't fully understand your explanation of why a value of 1s resolution in time isn't good enough for the "alerting and dashboarding" goal you've stated.
Your data can still be indexed with sub-second resolution
All of your data would get bucketed to a 1s interval for ML analysis
ML would detect anomalies on that timescale (of 1s)
You could alert on anomalies within seconds of occurrence, even in near real-time.
In other words, if you say that "alerting and dashboarding" is the main goal here, then an alert issued at 12:00:01 versus 12:00:02 (or even 12:00:10 for that matter) would be barely different from a human response perspective. People don't react at sub-second intervals, right?
Yes, you are right, I have been a little too generic during the goals explanation. My goal is to bring my client responsive and realtime dashboards, a useful alerting system and a machine learning anomalies analysis. The first and the second one are not a problem, since you can easily set the date histogram interval to milliseconds and you never need an alert every 10 milliseconds, that would be insane. The fact you can set the date histogram scale to milliseconds is the main reason I thought it would be possible to do that in the machine learning jobs too, to be honest. Your third bullet point is the only actual problem, when I rescaled the 10 minutes data on a 10 days range and set the minimum 1 second bucket, I got two or three anomalies, hidden in the 10 minutes range.
I understand it would be unwise to set a millisecond bucket size on a year range of data, but what if I want to analyze anomalies in a minute work of a motor that sends a new data every millisecond?
I know elasticsearch is mainly based on “search”, as the name states, but I think many people working with microcontrollers, fast paced networks, motors and IoT sensors might find a helping hand in elasticsearch brand new machine learning features.
Let me pose this question to come at your needs a little differently:
What is the minimum duration of an anomaly that you would care about? In other words, if a sensor value was 10x higher than "normal" for the duration of 1ms - would you (or anyone care)?
This is the same argument of monitoring anomalous CPU utilization of a server. If the CPU spikes for 1 second and goes back to normal, in most cases no one cares. However, if it spikes for a minute or two then there seriously may be something worth looking into. As such, analyzing for anomalies at the 1 second interval for CPU is silly, despite the fact that the data is truly that granular.
Let’s say I have a sensor measuring pressure that may determine the airbag explosion in a car. Just an example, not my specific case, luckily for me. That sensor sends a million data points every second, for the sake of the driver. If during a three seconds brake the sensor sent a higher value than usual for 6 microseconds and the airbag should have exploded after 5 microseconds, would you care or like to know it?
I could be snarky here and say that if I'm the driver of the car, then I don't care if the airbag deploys in 5us or 6us, I only care that I don't die. If I'm the designer of the car, then I might care about these differences. However, the designer of the car is NOT analyzing this data in real-time to determine whether or not to deploy the airbag, he/she is analyzing the data after the fact, as part of the design and/or testing phase.
So, I still don't fully understand your use case here. There is no possible way that one can get data from sensors centrally collected, ingested, indexed, and made searchable and analyzable at sub-second response time - by any toolset that exists, to my knowledge. Of course, with the Elastic Stack, you can get data that is this granular, and index it and analyze it - it will just be at a "real-time" rate that is measured in seconds, at least. It's not like you're planning on embedding Elasticsearch into the sensor/data stream of the car and expecting Elastic ML to determine whether or not to fire the airbag? Are you?
iLet me write down what I have to do in order to analyze my data correctly:
I collect 600,000 data points in a 10 minutes range (1 every millisecond)
I take the first epoch millisecond timestamp and I reindex the data. I calculate the epoch differences from the first data point and I multiply them by a 1000 factor to rescale my data, making every millisecond “look” like a second
Now that I have 600,000 data points across 10,000 minutes I can set the bucket_span to 1 second and use the full reindexed data to find anomalies
Here’s what I would like to do:
I collect 600,000 data points in a 10 minutes range (1 every millisecond) and I set a 1ms bucket_span
Let’s forget for a moment about alerts, dashboards and realtime jobs, I would only like to skip the “rescale” step. I think that if I have 600,000 or even 600,000,000 data points I should be able to detect anomalies and find patterns both if I collected the data in a minute or a year. Isn’t it just the number of data points that counts to detect anomalies? Does it really matter if the data “happened” in a year or a minute? And if it happened in a minute, are 60 buckets enough to detect anomalies?
Yes, I understand the logistical challenges of working around ML's 1s minimum limit for bucket_span, and I created an enhancement request for the development team that references this forum entry, but honestly, without a complete understanding of the use case (i.e. something that we fully understand and can assess if it is truly applicable to many of our users and not merely an inconvenience for you) then I'm not sure how aggressively this will be prioritized.
For the time being - it would seem that your workaround is good enough to get you by with your offline analysis. The product, used in the way that you describe, should be able to detect anomalies in your time-scaled data.
Well, at this point I just have to wait and hope some people will come up with other fine grained analysis that will move up this feature priority.
Thank you very much for spending all this time trying to assess my issue and helping me, I’ll accept your last reply as the answer.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.