ML anomaly detection for log-volume drop detection and looking for real-world experiences (low_count, high count)

battal · June 25, 2026, 6:51am

Hello dear Elastic Community,

I am new here. I'm building out log-source health monitoring across a multi-tenant ECE deployment and would love to hear how others have approached this with ML anomaly detection, especially around the tricky edge cases.

What I'm trying to solve

We already run an aggressive "log source is dead" rule (an Elasticsearch query rule that fires when a source drops below a fixed count over N minutes). That works as a terminal alarm, but it only catches a source once it's basically silent. What I actually want is the early indicator catching a source that's collapsing (e.g. from ~5M events/day down to a fraction of that) well before it hits the hard floor.

Current setup

Anomaly detection job on firewall log volume
Function: low_count (I only care about drops as the early warning; spikes are handled separately)
partition_field_name = observer.name → one independent baseline per firewall, so a single device going quiet stands out even while the others run normally
bucket_span: 15m for the high-volume sources
Two-tier alerting on the same job: a "minor" rule at anomaly score ~25 (info channel) and a "major" rule at 75+ (escalation), result type set to Record so it pins the individual source rather than aggregating the whole job

Where I'm unsure / what I'd like to hear about

True-zero behaviour. My understanding is that low_count struggles when a partition value goes to actual zero no documents means no clean buckets to model the transition, so the source disappears rather than producing a strong low anomaly. Is that your experience too? How are you bridging the gap between "abnormal drop" (ML) and "complete silence" (fixed threshold)? Are you just running both side by side like I am, or is there a cleaner pattern?
bucket_span tuning for mixed cadence. I have high-volume sources (15m feels right) alongside sparse/bursty ones where 15m flags every quiet period as anomalous. Splitting into two jobs by cadence, or is there a better approach (sparse data option, summary_count, etc.)?
Multi-value identifier fields. Some of my sources are only distinguishable via a tags array (multi-value), which isn't suitable as a partition field. I'm leaning toward deriving a single-value keyword in an ingest pipeline. Has anyone solved source identification differently?
Learning period in practice. How long did your models realistically need before the daily/weekly seasonality settled and the false-positive rate became manageable? I'm planning to run only the major (75) tier during the warm-up.

Any war stories, gotchas, or "I wish I'd known this earlier" notes are very welcome. Happy to share back what works on our side as it matures.

Best Regards

Rafa_Silva · July 7, 2026, 12:18am

Hi Battal, welcome to the Elastic community!

This is a great first question, and it is a very real-world use case. Based on the information shared so far, I think your current design is directionally correct, but I would avoid trying to make ML the only signal for this type of log-source health monitoring.

For this kind of scenario, I usually separate the problem into two different signals:

Degraded volume / abnormal drop

This is where low_count is a good fit. Elastic documents low_count as the count function used to detect when the number of events in a bucket is unusually low.

Complete silence / source is dead

I would keep this as a deterministic no-data, last-seen, or fixed-threshold rule. In my opinion, this is not just a fallback; it is the right control for the “true zero” case. ML is very useful for detecting that a source is behaving abnormally before it reaches zero, but the terminal condition where nothing is arriving anymore is usually better handled by a rule that explicitly checks last-seen time or count over a fixed window.

Your use of partition_field_name = observer.name also makes sense if each firewall should have its own independent baseline. The create anomaly detection job API describes partition_field_name as segmenting the analysis with completely independent baselines for each value.

The main point I would validate is whether observer.name is globally unique in your multi-tenant ECE environment. If the same firewall name can exist in different tenants or deployments, I would derive a canonical single-value field, for example:

source_health.id = tenant_id + "/" + observer.type + "/" + observer.name

Then I would use that field as the partition field. This avoids baseline collisions across tenants.

For the true-zero behavior, the cleanest pattern I have seen is layered alerting rather than choosing only one approach:

ML low_count for early degradation.
Deterministic no-data / last-seen alert for complete silence.
Optional source inventory or heartbeat-style index if you need to track expected sources even when no raw logs arrive.

This is also safer because ingest delay can create false positives with low_count. Elastic specifically notes that consistently delayed data can affect low_count jobs and may lead to false positives.

For mixed cadence sources, I would split jobs by behavior rather than force one bucket_span to work for everything. For example:

High-volume firewalls: 15m bucket span may be reasonable.
Sparse or bursty sources: use a larger bucket span, or keep them in a separate job with different alert thresholds.
Sources where quiet periods are normal: consider whether they should be monitored with non_zero_count, but only if zero buckets are not operationally important. Elastic documents that non_zero_count ignores buckets where the count is zero, so I would not use it for sources where silence is itself a failure condition.
Count functions | Elastic Docs

If you use datafeed aggregations for performance, remember that summary_count_field_name is intended for pre-aggregated input data. It is not really a general fix for sparse source behavior. Elastic also recommends that when count or rare functions use aggregations, the aggregation interval should match the bucket span.

For the multi-value identifier problem, I would also lean toward deriving a single-value keyword during ingest. Ingest pipelines are designed for transformations before indexing, so creating a stable field such as source_health.id would be a clean approach.

For the learning period, I would treat the first phase as calibration rather than production-quality alerting. There is no universal number because it depends on the bucket span, source cadence, and seasonality. In practice, I would want at least one full weekly cycle before trusting low-severity alerts, and often two to four weeks before the false-positive rate becomes comfortable for operational use.

Your idea of enabling only the major tier during warm-up sounds reasonable to me. I would probably start with only the 75+ rule, review the records manually, and then enable the lower score / info channel after the model has seen enough normal daily and weekly behavior.

One more operational detail: using Record-level alerting is the right direction when you want the alert to point to the affected source rather than only the overall job. Kibana ML alerting supports rules based on bucket, record, or influencer results.

So, in short, I would not replace your fixed “source is dead” rule. I would keep both:

ML for “this source is degrading compared to its own normal behavior”.
Deterministic rule for “this expected source has stopped sending data”.

That separation usually makes the alerting model easier to explain, easier to tune, and safer during ingest delays or sparse-source behavior.

battal · July 7, 2026, 1:30pm

Hello Rafa,

Thank you so much for the detailed reply. This is incredibly helpful. I'll take my time to read through it properly and work through the points you raised, especially the two-signal separation and the canonical source_health.id idea for the multi-tenant setup.

Really appreciate you taking the time to lay this out. I'll come back with follow-up questions once I've digested it.

Best Regards

Topic		Replies	Views
ML Anomaly Detection jobs gives very low score for absense of events Elasticsearch elastic-stack-machine-learning	5	152	October 18, 2024
Machine learning - host stopped sending logs or events Elasticsearch elastic-stack-machine-learning	14	2498	August 30, 2017
Log Rate Spikes alert Elastic Observability	5	138	April 25, 2025
Anomaly Detection for input logs (Elastic Agents) Elastic Observability	20	1039	November 27, 2023
Low log rate per agent.hostname Elasticsearch elastic-stack-machine-learning	6	838	May 6, 2021

ML anomaly detection for log-volume drop detection and looking for real-world experiences (low_count, high count)

Related topics