Missing data in daily aggregate compared to source and 5min aggregate

Hello,
We noticed a miss-match in some daily aggregates compared to 5min aggregates based on the same source data:

As you can see on the above charts, the 1day_agg chart shows one day with a way lower value (not 0 though, it's 2 here) than the corresponding day from 5min_agg chart or the source.
There might be days with lower differences (not checked, as it would require some scripting), but the one above is one of the more obvious to spot.

The issue occurred a few times since last month, but the example above on 1st of August is the only one where we have data for all three cases due to retention policies.

To give more context, we are using transforms to aggregate some access metrics from the source data stream ("report-geoplateforme.raw") in two separate indexes:

  • a 5min-aggregated index with the following transform setup:
{
  "id": "report-geoplateforme.aggregate-datastore_offering_5min",
  "authorization": {
    "roles": [
      "superuser"
    ]
  },
  "version": "10.0.0",
  "create_time": 1718701974152,
  "source": {
    "index": [
      "report-geoplateforme.raw"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "report-geoplateforme.aggregate-datastore_offering_5min"
  },
  "frequency": "150s",
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "5m"
        }
      },
      "gpf_community": {
        "terms": {
          "field": "gpf_community.keyword"
        }
      },
      "gpf_datastore": {
        "terms": {
          "field": "gpf_datastore.keyword"
        }
      },
      "gpf_offering": {
        "terms": {
          "field": "gpf_offering.keyword"
        }
      }
    },
    "aggregations": {
      "data_transfer": {
        "sum": {
          "field": "data_transfer"
        }
      },
      "begin_date": {
        "min": {
          "field": "@timestamp"
        }
      },
      "end_date": {
        "max": {
          "field": "@timestamp"
        }
      },
      "hits": {
        "value_count": {
          "field": "@timestamp"
        }
      }
    }
  },
  "settings": {},
  "retention_policy": {
    "time": {
      "field": "end_date",
      "max_age": "31d"
    }
  }
}
  • a 1day-aggregated index with the following transform setup:
{
  "id": "report-geoplateforme.aggregate-datastore_offering_1day",
  "authorization": {
    "roles": [
      "superuser"
    ]
  },
  "version": "10.0.0",
  "create_time": 1718701977246,
  "source": {
    "index": [
      "report-geoplateforme.raw"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "report-geoplateforme.aggregate-datastore_offering_1day"
  },
  "frequency": "1h",
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "1d"
        }
      },
      "gpf_community": {
        "terms": {
          "field": "gpf_community.keyword"
        }
      },
      "gpf_datastore": {
        "terms": {
          "field": "gpf_datastore.keyword"
        }
      },
      "gpf_offering": {
        "terms": {
          "field": "gpf_offering.keyword"
        }
      }
    },
    "aggregations": {
      "data_transfer": {
        "sum": {
          "field": "data_transfer"
        }
      },
      "begin_date": {
        "min": {
          "field": "@timestamp"
        }
      },
      "end_date": {
        "max": {
          "field": "@timestamp"
        }
      },
      "hits": {
        "value_count": {
          "field": "@timestamp"
        }
      }
    }
  },
  "settings": {}
}

Both resulting indexes are expected to overlap, that's why we are not using rollover/downsampling to move from the first to the later.

Both jobs seems to work smoothly (not even a single failure in stats, nor any message other than "Finished indexing for transform checkpoint [NN]." in .transform-notifications-*) and the pod (running on Kubernetes, managed by ECK) dedicated to transform operations is always way bellow allocated resources limits from the history we have of the concerned days.

From what I red in the thread "Transform missing data", I don't add any filter that could explain such a difference.

The only differences I see between both transform jobs are the amount of documents processed per checkpoint and the delay before processing them.

Could someone advise some other paths to investigate this issue please ? I'm not using transforms since a long time, so I might have miss-understood some mechanism that needs more tuning here...

Thanks in advance

There might be a problem with the delay being too long between the value in @timestamp and the actual time until it's indexed and available to be queried.

The recommended way to approach this is to use a timestamp that stores the time of ingestion to be used with sync.time.field, for example event.ingested for ECS based indices.

More details can be found here: Using the ingest timestamp for syncing the transform.

Hope this helps!

Hello @walterra, thanks for your fast reply :wink:

I red about this setup, but as the 5min aggregate is built successfully in "real-time" during the day, I thought it was not worth exploring this path to fix the daily aggregate...

I agree this could at least harden the process, so I will have to implement it anyway => I was only counting on the fact that all data get indexed on time for this first implementation and would have increased the delay if it was not the case... It's definitely worth implementing the best-practice you point instead of waiting for an issue to occur :sweat_smile:

I hope it won't make this thread wander too much, but as you raise that interesting point, could you confirm how would the bucket get updated ?
Will all values get merged accordingly between the already consolidated bucket and the values from the new event ? that is:

  • hits[bucket] += 1
  • data_transfer[bucket] += data_transfer[event]
  • begin_date[bucket] = min(begin_date[bucket], begin_date[event])
  • end_date[bucket] = max(end_date[bucket], end_date[event]).

The 1day transform is less prone to this kind of issues I hope, but I'm interested to the response for the 5min one and in case we have to inject some data later (like following an issue on the sending side of the logs processing chain...).

Anyway, from what I can tell, in the case of the 1day_agg chart above, even if the job might have missed some values at the end of the day because of the current setup not being hardened, it can't still explain how it missed that much (from ~109k down to only 2). The source data is spread nearly among the whole day from what we can see on this new chart:


So I would understand that one of the last 5min bucket is not counted, but here it is more like the whole day vanished before the 1day job started...

In case it could have any link with my issue, here is the ILM policy of indices from report-geoplateforme.raw data stream:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "7h",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "36h",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "delete": {
        "min_age": "14d",
        "actions": {
          "delete": {
            "delete_searchable_snapshot": true
          }
        }
      }
    }
  }
}

I thought it could be the culprit at first glance, but if I got things right, no action is taken on the aggregated indices before they got processed:

  • the hot-phase's rollover after 7h or 50GB only closes the current index and start writing into the next one
    => I can't imagine this could have any impact on transform jobs... right ? :sweat_smile:
  • the warm-phase's rollover should happen way after the transform job has finished
    => the worst case should be that the oldest index is being moved to warm nodes around 05:00AM on day+1, while this transform job runs for ~35min around 00:16AM (according to exponential_avg_checkpoint_duration_ms and the last timestamp_millis) and I don't see a CPU/RAM/IOdisk usage that could explain a ~5h duration... Neither should the move to warm-nodes cause such an issue, unless I missed some warnings buried in the docs :sweat_smile:

It's sad the transform stats API don't include the min/max duration of continuous jobs since they were created/started, as the second point could have been definitely discarded thanks to such a stat...

Anyway, I'm also pasting here the stats of this 1day job and it's 5min counterpart in case there is something obvious to expert eyes:

GET _transform/report-geoplateforme.aggregate-datastore_offering_*/_stats
{
  "count": 2,
  "transforms": [
    {
      "id": "report-geoplateforme.aggregate-datastore_offering_1day",
      "state": "started",
      "node": {
        "id": "cQLBAs4ZTt2ec_193MBt8g",
        "name": "elasticsearch-es-transform-0",
        "ephemeral_id": "OsG4t35qSqa7DENOCAO6VQ",
        "transport_address": "10.2.9.116:9300",
        "attributes": {}
      },
      "stats": {
        "pages_processed": 76105,
        "documents_processed": 14333094294,
        "documents_indexed": 36106809,
        "documents_deleted": 0,
        "trigger_count": 1166,
        "index_time_in_ms": 3160417,
        "index_total": 72989,
        "index_failures": 0,
        "search_time_in_ms": 118843931,
        "search_total": 76105,
        "search_failures": 0,
        "processing_time_in_ms": 357814,
        "processing_total": 76105,
        "delete_time_in_ms": 0,
        "exponential_avg_checkpoint_duration_ms": 2140883.3279842567,
        "exponential_avg_documents_indexed": 378383.89760958945,
        "exponential_avg_documents_processed": 273968338.6042385
      },
      "checkpointing": {
        "last": {
          "checkpoint": 50,
          "timestamp_millis": 1722903364880,
          "time_upper_bound_millis": 1722902400000
        },
        "changes_last_detected_at": 1722903364592,
        "last_search_time": 1722957366615
      },
      "health": {
        "status": "green"
      }
    },
    {
      "id": "report-geoplateforme.aggregate-datastore_offering_5min",
      "state": "started",
      "node": {
        "id": "cQLBAs4ZTt2ec_193MBt8g",
        "name": "elasticsearch-es-transform-0",
        "ephemeral_id": "OsG4t35qSqa7DENOCAO6VQ",
        "transport_address": "10.2.9.116:9300",
        "attributes": {}
      },
      "stats": {
        "pages_processed": 460282,
        "documents_processed": 14172099216,
        "documents_indexed": 192767003,
        "documents_deleted": 2605556,
        "trigger_count": 27871,
        "index_time_in_ms": 17301517,
        "index_total": 396734,
        "index_failures": 0,
        "search_time_in_ms": 100974571,
        "search_total": 460282,
        "search_failures": 0,
        "processing_time_in_ms": 2177260,
        "processing_total": 460282,
        "delete_time_in_ms": 588537,
        "exponential_avg_checkpoint_duration_ms": 7977.7250157222825,
        "exponential_avg_documents_indexed": 1908.3633738835777,
        "exponential_avg_documents_processed": 2181767.2220575055
      },
      "checkpointing": {
        "last": {
          "checkpoint": 13960,
          "timestamp_millis": 1722958050842,
          "time_upper_bound_millis": 1722957900000
        },
        "changes_last_detected_at": 1722958050833,
        "last_search_time": 1722958050833
      },
      "health": {
        "status": "green"
      }
    }
  ]
}

All other 5min jobs run bellow ~5s and all other 1day jobs run bellow ~5min, according to their exponential_avg_checkpoint_duration_ms.
There are 10 jobs of each type including the ones exposed here.

"timestamp_millis": 1722903364880,
"time_upper_bound_millis": 1722902400000
"last_search_time": 1722957366615

converting to human time

"timestamp_millis": 00:16:04 UTC
"time_upper_bound_millis": 00:00:00 UTC
"last_search_time": 15:16:06 UTC

I could be wrong, but I would think the timestamps should be closer, at least within the frequency range. last_search_time is close to when the checkpoint starts. timestamp_millis is the last "start time" for the search query range, and it gets updated when the checkpoint finishes. My hunch is that the checkpoint is taking a lot longer than the average leads us to believe.

Now that another day has gone by, can you check the stats API again to see what the output says? We should have progressed 1 checkpoint per hour since then, since the frequency is 1 hour, so anything less than that can give us an idea if it isn't keeping up.

Are there any errors in the logs, either for the Transform or for the search/index nodes where the requests may be running against the raw index?

it can't still explain how it missed that much (from ~109k down to only 2).

You're right, I would expect it would miss a few hours of data at worst.

Will all values get merged accordingly between the already consolidated bucket and the values from the new event ?

My understanding is that the values get updated via a rewrite, kinda like a "purge and replace." For example, the 1 day search will search over the current day's data every 1hr and write the results to the destination index. The next hour it runs, it will redo the search, and any new data will be aggregated with the old data since it is the same day. So it's effectively the same as data_transfer[bucket] += data_transfer[event].

What might be happening here is that the transform runs at 11:01pm and collects the day of data, then runs at 12:01am and collects the new day of data, effectively skipping all the new data entered between 11:01pm and 11:59:59... But I would expect a lot more data in the bucket than "2".

align_checkpoints is the Transform setting that aligns the checkpoint's time range to the fixed_interval specified in the group_by. That is why we see "time_upper_bound_millis": 1722902400000. Updating this setting to false will disable this feature and instead set the upper_bound to now - delay which will avoid missing that data but also might "purge" data from earlier in the day if my understanding is correct.

Hello @Patrick_Whelan,

Thanks for your help on that subject :wink:

I could be wrong, but I would think the timestamps should be closer, at least within the frequency range.

I'm not sure I get your point, but isn't it due to optimizations from align_checkpoints ? (defaulting to true according to the Create transform API)
If I understood correctly, it avoids intermediate runs (and thus updates) between each search able to build a full bucket of date_histogram.fixed_interval. At least, that would explain why the checkpoint number only increased by 2 in the last stats, as only 2 full-bucket runs took place...
My bet is that last_search_time will increase at the pace of frequency while timestamp_millis gets aligned to the start of the day because of date_histogram.fixed_interval = 1d and align_checkpoints = true.
That's not a proof, but all my 1day jobs have a timestamp_millis between 00:00:00 UTC and 01:00:00 UTC (even before 00:20:00 UTC...) and checkpoint = 52 (all jobs were created the same day).

Now that another day has gone by, can you check the stats API again to see what the output says?

{
  "count": 2,
  "transforms": [
    {
      "id": "report-geoplateforme.aggregate-datastore_offering_1day",
      "state": "started",
      "node": {
        "id": "cQLBAs4ZTt2ec_193MBt8g",
        "name": "elasticsearch-es-transform-0",
        "ephemeral_id": "OsG4t35qSqa7DENOCAO6VQ",
        "transport_address": "10.2.9.116:9300",
        "attributes": {}
      },
      "stats": {
        "pages_processed": 76343,
        "documents_processed": 14976442333,
        "documents_indexed": 36196047,
        "documents_deleted": 0,
        "trigger_count": 1213,
        "index_time_in_ms": 3168780,
        "index_total": 73179,
        "index_failures": 0,
        "search_time_in_ms": 120987217,
        "search_total": 76343,
        "search_failures": 0,
        "processing_time_in_ms": 358867,
        "processing_total": 76343,
        "delete_time_in_ms": 0,
        "exponential_avg_checkpoint_duration_ms": 1774628.6741051634,
        "exponential_avg_documents_indexed": 266899.46864774165,
        "exponential_avg_documents_processed": 289437424.9664737
      },
      "checkpointing": {
        "last": {
          "checkpoint": 52,
          "timestamp_millis": 1723076170732,
          "time_upper_bound_millis": 1723075200000
        },
        "changes_last_detected_at": 1723076170724,
        "last_search_time": 1723126572722
      },
      "health": {
        "status": "green"
      }
    },
    {
      "id": "report-geoplateforme.aggregate-datastore_offering_5min",
      "state": "started",
      "node": {
        "id": "cQLBAs4ZTt2ec_193MBt8g",
        "name": "elasticsearch-es-transform-0",
        "ephemeral_id": "OsG4t35qSqa7DENOCAO6VQ",
        "transport_address": "10.2.9.116:9300",
        "attributes": {}
      },
      "stats": {
        "pages_processed": 464460,
        "documents_processed": 14780000275,
        "documents_indexed": 193421268,
        "documents_deleted": 2853220,
        "trigger_count": 29013,
        "index_time_in_ms": 17375903,
        "index_total": 398515,
        "index_failures": 0,
        "search_time_in_ms": 102846428,
        "search_total": 464460,
        "search_failures": 0,
        "processing_time_in_ms": 2186052,
        "processing_total": 464460,
        "delete_time_in_ms": 657404,
        "exponential_avg_checkpoint_duration_ms": 8836.59149519649,
        "exponential_avg_documents_indexed": 1962.3107466320855,
        "exponential_avg_documents_processed": 2114815.2370356224
      },
      "checkpointing": {
        "last": {
          "checkpoint": 14531,
          "timestamp_millis": 1723129404686,
          "time_upper_bound_millis": 1723129200000
        },
        "changes_last_detected_at": 1723129404679,
        "last_search_time": 1723129554726
      },
      "health": {
        "status": "green"
      }
    }
  ]
}

Focusing on the 1day job, we have:

"timestamp_millis": 1723076170732,
"time_upper_bound_millis": 1723075200000
"last_search_time": 1723126572722

Converted to human-readable:

"timestamp_millis": 2024-08-08T00:16:10 UTC
"time_upper_bound_millis": 2024-08-08T00:00:00 UTC
"last_search_time": 2024-08-08T14:16:12 UTC

We should have progressed 1 checkpoint per hour since then, since the frequency is 1 hour, so anything less than that can give us an idea if it isn't keeping up.

We are looking at run 52 and the timestamp_millis shifted only by a few seconds compared to run 50 posted 2 days ago. I guess it's also linked to the point above about align_checkpoints (non full-bucket runs were skipped, so checkpoint was not increased that much...)

Are there any errors in the logs, either for the Transform or for the search/index nodes where the requests may be running against the raw index?

I see no errors in Transform messages, only the job creation and successful checkpoints (1 by 1 at first and now only every 10 checkpoints since it reached 10).

That said, you pointed an interesting path about the node logs, as even if it's only WARN, I see a lot of those in the tranform-dedicated pod (most occurences concerning the index with partial data, but also 1 or 2 occurences for each other 1day job):

{
  "@timestamp": "2024-08-08T00:24:53.806Z",
  "log.level": "WARN",
  "message": "[report-geoplateforme.aggregate-datastore_offering_1day] Search context missing, falling back to normal search; request [apply_results]",
  "ecs.version": "1.2.0",
  "service.name": "ES_ECS",
  "event.dataset": "elasticsearch.server",
  "process.thread.name": "elasticsearch[elasticsearch-es-transform-0][transport_worker][T#3]",
  "log.logger": "org.elasticsearch.xpack.transform.transforms.ClientTransformIndexer",
  "elasticsearch.cluster.uuid": "KFRBIrwYSUCVZSbgwhYVWg",
  "elasticsearch.node.id": "cQLBAs4ZTt2ec_193MBt8g",
  "elasticsearch.node.name": "elasticsearch-es-transform-0",
  "elasticsearch.cluster.name": "elasticsearch",
  "error.type": "org.elasticsearch.action.search.SearchPhaseExecutionException",
  "error.message": "Partial shards failure",
  "error.stack_trace": "<FULL_STACK_BELLOW>"
}
Full stack here
Failed to execute phase [query], Partial shards failure; shardFailures {[S6SI9BUhTDmtO00OWkfi7g][.ds-report-geoplateforme.raw-2024.07.25-000206][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-2][10.2.178.20:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835916]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[7ixH2vqrRYCkbqOebGB94A][.ds-report-geoplateforme.raw-2024.07.25-000210][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-0][10.2.183.164:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4397587]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[7ixH2vqrRYCkbqOebGB94A][.ds-report-geoplateforme.raw-2024.07.26-000214][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-0][10.2.183.164:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4397586]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[7ixH2vqrRYCkbqOebGB94A][.ds-report-geoplateforme.raw-2024.07.26-000216][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-0][10.2.183.164:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4397588]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[S6SI9BUhTDmtO00OWkfi7g][.ds-report-geoplateforme.raw-2024.07.26-000218][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-2][10.2.178.20:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835917]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[7ixH2vqrRYCkbqOebGB94A][.ds-report-geoplateforme.raw-2024.07.27-000220][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-0][10.2.183.164:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4397589]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[S6SI9BUhTDmtO00OWkfi7g][.ds-report-geoplateforme.raw-2024.07.27-000222][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-2][10.2.178.20:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835914]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[7ixH2vqrRYCkbqOebGB94A][.ds-report-geoplateforme.raw-2024.07.27-000224][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-0][10.2.183.164:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4397590]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[7ixH2vqrRYCkbqOebGB94A][.ds-report-geoplateforme.raw-2024.07.28-000226][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-0][10.2.183.164:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4397593]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[S6SI9BUhTDmtO00OWkfi7g][.ds-report-geoplateforme.raw-2024.07.28-000230][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-2][10.2.178.20:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835915]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[S6SI9BUhTDmtO00OWkfi7g][.ds-report-geoplateforme.raw-2024.07.29-000236][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-2][10.2.178.20:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835918]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[S6SI9BUhTDmtO00OWkfi7g][.ds-report-geoplateforme.raw-2024.07.30-000242][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-2][10.2.178.20:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835920]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[S6SI9BUhTDmtO00OWkfi7g][.ds-report-geoplateforme.raw-2024.08.02-000262][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-warm-cold-content-2][10.2.178.20:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835919]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[fH0nAB18SuO-30NtHK3cRA][.ds-report-geoplateforme.raw-2024.08.06-000290][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-hot-2][10.2.178.16:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [1957919]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}{[fH0nAB18SuO-30NtHK3cRA][.ds-report-geoplateforme.raw-2024.08.07-000301][0]: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-data-hot-2][10.2.178.16:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [1957921]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)
}
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:712)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:418)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:744)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:497)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:335)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:53)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:634)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.TransportService$UnregisterChildTransportResponseHandler.handleException(TransportService.java:1751)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1475)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundHandler.doHandleException(InboundHandler.java:475)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:462)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:453)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundHandler.executeResponseHandler(InboundHandler.java:145)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:122)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:96)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:821)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:124)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:96)
	at org.elasticsearch.server@8.13.4/org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:61)
	at org.elasticsearch.transport.netty4@8.13.4/org.elasticsearch.transport.netty4.Netty4MessageInboundHandler.channelRead(Netty4MessageInboundHandler.java:48)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.codec@4.1.94.Final/io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1383)
	at io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1246)
	at io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1295)
	at io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	at io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.transport@4.1.94.Final/io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)
	at io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)
	at io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at io.netty.common@4.1.94.Final/io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.common@4.1.94.Final/io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [4835916]
	at org.elasticsearch.search.SearchService.findReaderContext(SearchService.java:910)
	at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:924)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:671)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:549)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:73)
	at org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:70)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)

From what I read in Continuous transform failures it looks like this is not supposed to be an issue by itself and I don't see any " irrecoverable failure", but I don't have enough log history to confirm there was nothing worse than those WARN logs during the affected days... I will try to fix the log collection process on my cluster and come back with more details if I manage to record a new occurence :sweat_smile:

As a fallback, is there any way to catch even the WARN events in the transform messages ? I don't see why an INFO log, such as "Finished indexing for transform checkpoint [NN]." is found there, but none of the WARN logs are...
Written in another way => is there a way to make WARN messages end-up in .transform-notifications index as well ?

My understanding is that the values get updated via a rewrite, kinda like a "purge and replace." For example, the 1 day search will search over the current day's data every 1hr and write the results to the destination index. The next hour it runs, it will redo the search, and any new data will be aggregated with the old data since it is the same day. So it's effectively the same as data_transfer[bucket] += data_transfer[event].

I fear it's no good news for my post-incident recovery scenario => I only have 2 weeks of raw data history, so adding some missing events after this delay could result in a bucket only containing those events, which could be more harmful than valuable :frowning_face:
I will maybe focus on this subject with some live tests later and create a separate thread if I'm facing any unexpected/hazardous behaviors.
Sorry for mixing things in this already headache-prone thread, let's focus on finding why some buckets are nearly empty, as I think it's on the good path with the WARN logs above :wink:

If I understood correctly, it avoids intermediate runs (and thus updates) between each search able to build a full bucket of date_histogram.fixed_interval .

You are correct, I set up a test and attached a breakpoint, it will run the checkpoint but align_checkpoint forces the gte and lt query range to be the same, so the checkpoint just checks a time range of 0. When there are no results, it drops the checkpoint (but sets the last_search_time). This will continue every frequency until the date_histogram.fixed_interval rolls over, at which point it will search for gte: <last.time_upper_bound_millis> and lt: <next.time_upper_bound_millis>, should be both 00:00:00 but 24h apart.

"Search context missing, falling back to normal search; request"

Yeah this is benign. Transform tries to search over indices using Point in Time, that WARN just means there were shard failures on the indices for that snapshot. In this case, Transform just drops the Point in Time and immediately retries the Search request without it.

As a fallback, is there any way to catch even the WARN events in the transform messages ? I don't see why an INFO log, such as "Finished indexing for transform checkpoint [NN]." is found there, but none of the WARN logs are...
Written in another way => is there a way to make WARN messages end-up in .transform-notifications index as well ?

.transform-notifications is a custom audit trail and is separate from the log4j logger. Failures do sometimes make it there, but only if they caused the Transform to abort the checkpoint (and potentially retry). All other logs are written to the log4j logger and should appear in the node logs depending on the configured log level.


More logs would be good. From everything we can see now, it looks like the search request resulted in a bucket size of 2, so the Transform thought it succeeded and moved the "last upper bound" forward to the current day.

Hello @Patrick_Whelan,

I caught the logs of a new failure (aggregate of 2024-08-28, from the transform job running on 2024-08-29), but most of them are made of the "Search context missing, falling back to normal search" warning from my last comment.

I can confirm those logs occur every day during the night between 00:00AM and around 02:30AM and seem to match with the 1day aggregate runs, even during days without missing hits (or at least none that have been spotted ATM).

I also spotted a few logs during the day (~12:00AM), so some of the 5min aggregates must be concerned as well, which can be confirmed by the chart bellow (there is a little gap of ~1k hits between the raw and 5min values on 2028-08-28):

Back to the logs of the failed 1day aggregate, I only see unrelated INFO logs in the middle of the recurring WARN... or at least they seem to be, so I'm keeping them in the hidden block bellow just in case :sweat_smile:.

Excluded logs
[plugins.securitySolution.endpoint:user-artifact-packager:1.0.0] Started. Checking for changes to endpoint artifacts
[plugins.securitySolution.endpoint:user-artifact-packager:1.0.0] Last computed manifest not available yet
[plugins.securitySolution.endpoint:user-artifact-packager:1.0.0] Complete. Task run took 5ms [ stated: 2024-08-28T02:28:36.188Z ]
...
[plugins.fleet] Running Fleet Usage telemetry send task
[plugins.fleet] Fleet Usage: {"agents_enabled":true,"agents":{"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"inactive":0,"unenrolled":0,"total_all_statuses":0,"updating":0},"fleet_server":{"total_all_statuses":0,"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"updating":0,"num_host_urls":0}}
...
[org.elasticsearch.xpack.ml.action.TransportDeleteExpiredDataAction] Deleting expired data
[org.elasticsearch.xpack.ml.action.TransportDeleteExpiredDataAction] Completed deletion of expired ML data
...
[org.elasticsearch.xpack.ml.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
[org.elasticsearch.xpack.ml.MlDailyMaintenanceService] Successfully completed [ML] maintenance task: triggerDeleteExpiredDataTask
...
[org.elasticsearch.xpack.ml.job.retention.UnusedStatsRemover] Successfully deleted [0] unused stats documents
...
[org.elasticsearch.xpack.slm.SnapshotRetentionTask] starting SLM retention snapshot cleanup task
[org.elasticsearch.xpack.slm.SnapshotRetentionTask] there are no repositories to fetch, SLM retention snapshot cleanup task complete

The most related logs could be the phase transitions occurring every 7h, but they are alway against an index containing events older than the running aggregate (here against an index from 2024-08-26 during the aggregate of 2024-08-28):

[org.elasticsearch.xpack.ilm.IndexLifecycleTransition] moving index [.ds-report-geoplateforme.raw-2024.08.26-000425] from [{"phase":"hot","action":"complete","name":"complete"}] to [{"phase":"warm","action":"set_priority","name":"set_priority"}] in policy [report-geoplateforme.raw]
[org.elasticsearch.xpack.ilm.IndexLifecycleTransition] moving index [.ds-report-geoplateforme.raw-2024.08.26-000425] from [{"phase":"warm","action":"set_priority","name":"set_priority"}] to [{"phase":"warm","action":"allocate","name":"allocate"}] in policy [report-geoplateforme.raw]
[org.elasticsearch.cluster.metadata.MetadataUpdateSettingsService] updating number_of_replicas to [0] for indices [.ds-report-geoplateforme.raw-2024.08.26-000425]
[org.elasticsearch.xpack.ilm.IndexLifecycleTransition] moving index [.ds-report-geoplateforme.raw-2024.08.26-000425] from [{"phase":"warm","action":"allocate","name":"allocate"}] to [{"phase":"warm","action":"allocate","name":"check-allocation"}] in policy [report-geoplateforme.raw]
[org.elasticsearch.xpack.ilm.IndexLifecycleTransition] moving index [.ds-report-geoplateforme.raw-2024.08.26-000425] from [{"phase":"warm","action":"allocate","name":"check-allocation"}] to [{"phase":"warm","action":"migrate","name":"branch-check-skip-action"}] in policy [report-geoplateforme.raw]
[org.elasticsearch.xpack.ilm.IndexLifecycleTransition] moving index [.ds-report-geoplateforme.raw-2024.08.26-000425] from [{"phase":"warm","action":"migrate","name":"branch-check-skip-action"}] to [{"phase":"warm","action":"migrate","name":"migrate"}] in policy [report-geoplateforme.raw]
[org.elasticsearch.xpack.ilm.IndexLifecycleTransition] moving index [.ds-report-geoplateforme.raw-2024.08.26-000425] from [{"phase":"warm","action":"migrate","name":"migrate"}] to [{"phase":"warm","action":"migrate","name":"check-migration"}] in policy [report-geoplateforme.raw]
[org.elasticsearch.xpack.ilm.IndexLifecycleTransition] moving index [.ds-report-geoplateforme.raw-2024.08.26-000425] from [{"phase":"warm","action":"migrate","name":"check-migration"}] to [{"phase":"warm","action":"complete","name":"complete"}] in policy [report-geoplateforme.raw]

Those transition events occur also during the day around the time of WARN in logs related to the 5min jobs, but as seen on the chart above, the 5min job results are perfectly aligned to the raw data (except for the day where the 1day job missed most hits), so I'm a bit lost on what to think about this suspicious timing... :upside_down_face:

Unless those last details ring some bells with the view on the source code, I think I will try a wild guess and review the ILM policy and the 1day job schedule to have the aggregate occur when indices are already on the warm nodes, which are way less requested...
I hope this could at least help getting rid of those "Search context missing, falling back to normal search" warnings, even though they don't seem to be the root cause here... :crossed_fingers:
It could also be a good opportunity to implement @walterra's advice about ingest timestamps, as I did not had the time to since he spoted this in my first comment :sweat_smile: