ML anomaly detection for log-volume drop detection and looking for real-world experiences (low_count, high count)

Hello dear Elastic Community,

I am new here. I'm building out log-source health monitoring across a multi-tenant ECE deployment and would love to hear how others have approached this with ML anomaly detection, especially around the tricky edge cases.

What I'm trying to solve

We already run an aggressive "log source is dead" rule (an Elasticsearch query rule that fires when a source drops below a fixed count over N minutes). That works as a terminal alarm, but it only catches a source once it's basically silent. What I actually want is the early indicator catching a source that's collapsing (e.g. from ~5M events/day down to a fraction of that) well before it hits the hard floor.

Current setup

  • Anomaly detection job on firewall log volume
  • Function: low_count (I only care about drops as the early warning; spikes are handled separately)
  • partition_field_name = observer.name → one independent baseline per firewall, so a single device going quiet stands out even while the others run normally
  • bucket_span: 15m for the high-volume sources
  • Two-tier alerting on the same job: a "minor" rule at anomaly score ~25 (info channel) and a "major" rule at 75+ (escalation), result type set to Record so it pins the individual source rather than aggregating the whole job

Where I'm unsure / what I'd like to hear about

  1. True-zero behaviour. My understanding is that low_count struggles when a partition value goes to actual zero no documents means no clean buckets to model the transition, so the source disappears rather than producing a strong low anomaly. Is that your experience too? How are you bridging the gap between "abnormal drop" (ML) and "complete silence" (fixed threshold)? Are you just running both side by side like I am, or is there a cleaner pattern?
  2. bucket_span tuning for mixed cadence. I have high-volume sources (15m feels right) alongside sparse/bursty ones where 15m flags every quiet period as anomalous. Splitting into two jobs by cadence, or is there a better approach (sparse data option, summary_count, etc.)?
  3. Multi-value identifier fields. Some of my sources are only distinguishable via a tags array (multi-value), which isn't suitable as a partition field. I'm leaning toward deriving a single-value keyword in an ingest pipeline. Has anyone solved source identification differently?
  4. Learning period in practice. How long did your models realistically need before the daily/weekly seasonality settled and the false-positive rate became manageable? I'm planning to run only the major (75) tier during the warm-up.

Any war stories, gotchas, or "I wish I'd known this earlier" notes are very welcome. Happy to share back what works on our side as it matures.

Best Regards