ML jobs has no warnings or errors but message seems to be lagging

Whoami1980 · March 20, 2026, 8:54am

In one of our ML jobs. There doesnt appears to be any errors or warning in the job message.

However the latest_record_timestamp is lagging behind the current_timestamp

KIndly advice how we can further look into this.


###Counts

job_id				pred_maint-ABCDEF-deny-high-count
processed_record_count		46,880,506,574
processed_field_count		348,373,955,401
input_bytes				18.4 TB
input_field_count			348,373,955,401
invalid_date_count		0
missing_field_count		26,670,097,191
out_of_order_timestamp_count	0
empty_bucket_count		2
sparse_bucket_count		2
bucket_count			3,275
earliest_record_timestamp	2026-02-11 11:21:59
latest_record_timestamp		2026-03-17 11:14:06
last_data_time			2026-03-20 16:43:29
latest_empty_bucket_timestamp	2026-02-11 12:00:00
latest_sparse_bucket_timestamp2026-03-16 03:45:00
input_record_count		46,880,506,574
log_time				2026-03-20 16:43:29
latest_bucket_timestamp		2026-03-17 09:45:00


###Model size stats

job_id				pred_maint-ABCDEF-deny-high-count
result_type				model_size_stats
model_bytes				124.1 MB
peak_model_bytes			130.8 MB
model_bytes_exceeded		0.0 B
model_bytes_memory_limit	512.0 MB
total_by_field_count		103
total_over_field_count		0
total_partition_field_count	102
bucket_allocation_failures_count	0
memory_status			ok
assignment_memory_basis		current_model_bytes
output_memory_allocator_bytes	29363
categorized_doc_count		0
total_category_count		0
frequent_category_count		0
rare_category_count		0
dead_category_count		0
failed_category_count		0
categorization_status		ok
log_time				2026-03-20 13:45:30
timestamp				2026-03-17 10:00:00


###Job timing stats

job_id							pred_maint-ABCDEF-deny-high-count
bucket_count						2,783
total_bucket_processing_time_ms			720,306
minimum_bucket_processing_time_ms			20
maximum_bucket_processing_time_ms			1,820
average_bucket_processing_time_ms			258.824
exponential_average_bucket_processing_time_ms	294.209
exponential_average_bucket_processing_time_per_hour_ms	824.883

taylorbrooks · March 24, 2026, 2:53pm

The gap between latest_record_timestamp (2026-03-17) and last_data_time (2026-03-20) tells you the datafeed is still pushing data, but the ML job is falling behind on processing it.

Looking at your stats, the likely bottleneck is throughput vs. volume. Your job has processed ~46.8 billion records with an average bucket processing time of ~259ms. At 3,275 buckets, that's a lot of data per bucket. A couple things to check:

1. Check the datafeed search latency:

GET _ml/datafeeds/datafeed-pred_maint-ABCDEF-deny-high-count/_stats

Look at search_count and average_search_time_in_ms. If the average search time is high (several seconds), the datafeed is struggling to pull data fast enough.

2. Check if the query_delay is too conservative:

The default query_delay is usually 60s, but if your source indices have high ingestion latency, you might need to increase it. Conversely, if you decreased it, the datafeed may be doing many small searches instead of fewer larger ones.

3. The missing_field_count (26.6 billion) is significant:

That's roughly 57% of your total records. This means over half of your input records are missing the analysis field. While this won't cause errors, the job still has to read and discard those records, which burns processing time. If you can tighten your datafeed query to filter out records that don't have the target field, you'll significantly reduce the processing backlog.

Try adding a filter to the datafeed query:

{
  "bool": {
    "must": [
      { "exists": { "field": "your_analysis_field" } }
    ]
  }
}

That alone could cut your datafeed volume nearly in half and help the job catch up.

Whoami1980 · March 25, 2026, 4:06am

@taylorbrooks

Appreciate your time and assistance.

Kibana web >> Machine Learning >> Anomaly Detection >> Jobs >> Anormally Detection Job

If i do a query

FROM .ml-anomalies*

i can see the following fields

job_id
last_data_time
search_count
latest_record_timestamp

but how can i get these fields for a complete data

query_delay
throughput
volume

and ideally compare it if the "last_data_time" and "latest_record_timestamp" is too far off

POST /_query?format=txt
{
  "query": """
  FROM .ml-anomalies*
|   WHERE missing_field_count > 50
|   STATS 
      last_data = MAX(last_data_time),
      latest_record = MAX(latest_record_timestamp),
      search_count = MAX(search_count),
      missing_count = MAX(missing_field_count)
    BY job_id
|   SORT job_id DESC
|   LIMIT 100
    """
}


       last_data        |     latest_record      | search_count  | missing_count |                            job_id                            
------------------------+------------------------+---------------+---------------+--------------------------------------------------------------
2026-03-25T05:54:33.532Z|2026-03-25T05:52:29.994Z|null           |129340770      |ABCD                                   
2026-03-06T08:12:30.130Z|2026-02-10T04:33:17.041Z|null           |388379047410   |EFGH                              
2026-03-06T07:53:39.923Z|2025-12-18T15:19:28.000Z|null           |815588631      |JKLM

Kindly advice

richcollier · March 25, 2026, 3:20pm

@taylorbrooks gives solid advice to @Whoami1980 about the efficiencies of the datafeed processing/moving the raw historical data through the ML algorithms for processing.

The key question is if your job is going to “catch up” to real-time or not. In other words, it’ll eventually catch up if the amount of time to process a bucket’s worth of data is less than the bucket span. In other words, if it takes 1 minute to get and process 15 minutes worth of data, then you’ll eventually catch up. If it takes 20 mintues to do that, you’ll just fall more and more behind over time.

There are mechanisms to help. As pointed out by @taylorbrooks - you can filter out the irrelevant data.

Alternatively, you can pre-aggregate your data before sending it to the ML node.

Whoami1980 · March 27, 2026, 3:12am

@richcollier i am sure and very happy that @taylorbrooks he is helping me.

however, i also did my part to try to follow his advice

as you can see here search_count and search_time is 0

and if u look at the data view there isnt query or delay field

POST /_query?format=txt
{
  "query": """
  FROM .ml-anomalies*
|   WHERE missing_field_count > 50
|   STATS 
      last_data = MAX(last_data_time),
      latest_record = MAX(latest_record_timestamp),
      search_count = MAX(search_count),
      missing_count = MAX(missing_field_count),
      search_time = Max(total_search_time_ms)
    BY job_id
|   SORT missing_count DESC
|   LIMIT 100
    """
}


       last_data        |     latest_record      | search_count  | missing_count |  search_time  |                            job_id                            
------------------------+------------------------+---------------+---------------+---------------+--------------------------------------------------------------
2026-03-05T04:58:55.453Z|2026-03-04T17:09:30.155Z|null           |1873040552758  |null           |                 
2026-03-06T08:12:30.130Z|2026-02-10T04:33:17.041Z|null           |388379047410   |null           |                            
2026-03-06T08:12:31.059Z|2026-02-18T10:45:33.557Z|null           |298201165372   |null           |                                
2026-03-27T03:06:16.784Z|2026-03-27T02:52:29.962Z|null           |98097850101    |null           |                        
2026-01-31T07:43:17.441Z|2026-01-09T15:48:52.744Z|null           |84504242927    |null           |                               
2026-03-27T03:06:15.848Z|2026-03-27T02:52:29.962Z|null           |67448682708    |null           |          
2026-03-06T08:11:41.504Z|2025-10-31T16:09:56.175Z|null           |65857243220    |null           |                             
2026-03-06T08:12:19.693Z|2025-12-03T00:59:31.000Z|null           |43473003282    |null           |                           
2026-03-27T03:06:59.917Z|2026-03-26T22:32:07.184Z|null           |28311362868    |null           |                           
2026-03-06T08:12:21.880Z|2026-01-01T01:46:58.214Z|null           |22038507353    |null           |                            
2026-03-27T03:06:59.602Z|2026-03-25T17:39:49.086Z|null           |20382856432    |null           |                    
2026-03-27T03:06:58.652Z|2026-03-22T02:25:10.465Z|null           |18671372460    |null           |                         
2026-03-27T03:06:58.667Z|2026-03-27T02:42:42.055Z|null           |11420005024    |null           |                              
2026-02-27T00:49:24.348Z|2026-02-21T08:30:12.205Z|null           |8744129138     |null           |                      
2026-03-27T03:04:38.642Z|2026-03-27T02:52:22.201Z|null           |8690612928     |null           |             
2026-03-27T03:06:59.592Z|2026-03-26T03:53:30.000Z|null           |7813129095     |null           |

dot-mike · March 28, 2026, 5:46pm

Because this dumb guy @ taylorbrooks was using AI to make up an answer. He has zero knowledge about Elasticsearch.

You can see all the field mappings for .ml-anomalies index here. The response from this "bot" is just normal AI dreaming and thinking such fields exists.

Whoami1980 · March 30, 2026, 5:55am

@dot-mike

With reference to the github link u have given,
I can see the fields that i am using is correct.
"search_count" and "total_search_time_ms"
but my query in previous reply is returning NULL

Topic		Replies	Views
Discrepancy with Data Indexing, Processed records and Time Stamp for ML Job Elasticsearch elastic-stack-machine-learning	9	1097	October 17, 2017
Anomoly detection jobs. latest timestamp is updating but more than 2 months back Elasticsearch elastic-stack-machine-learning	3	36	March 4, 2026
Machine Learning datafeed skipping documents that seem to be there Elasticsearch elastic-stack-machine-learning	14	4376	April 9, 2019
Machine learning jobs not reflecting new data Elasticsearch elastic-stack-machine-learning	5	896	October 30, 2018
ML Datafeed lookback retrieved no data Elasticsearch elastic-stack-machine-learning	10	2546	April 17, 2018

ML jobs has no warnings or errors but message seems to be lagging

Related topics