When indices only exist on remote cluster: datafeed [xxxx] cannot retrieve data because no index matches datafeed's indices

I am getting the following error when trying to start the datafeed of an ML Anomaly Detection Job:

mydatafeed failed to start

datafeed [mydatafeed] cannot retrieve data because no index matches datafeed's indices [foo.bar*, *:foo.bar*]

'See the full error' shows this:

{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "datafeed [mydatafeed] cannot retrieve data because no index matches datafeed's indices [foo.bar*, *:foo.bar*]"
      }
    ],
    "type": "status_exception",
    "reason": "datafeed [mydatafeed] cannot retrieve data because no index matches datafeed's indices [foo.bar*, *:foo.bar*]"
  },
  "status": 400
}

The setup is as follows, on v8.7.0

  • one elastic cluster which we'll call the "primary" with Kibana and ML node(s)
  • other linked trusted elastic clusters used for Cross Cluster Search (CCS)

I have several machine learning jobs running fine, but am trying to add a new one for a different set of indices. It seems to me that the crucial difference with this new one is that currently there are no indices/datastreams matching foo.bar on the "primary" elastic cluster.

These are hopefully the relevant bits of the datafeed_config in the ML job:

    "indices_options": {
      "expand_wildcards": [
        "open"
      ],
      "ignore_unavailable": false,
      "allow_no_indices": true,
      "ignore_throttled": true
    },
    "query": {
      ...
    },
    "indices": [
      "foo.bar*"
      "*:foo.bar*"
    ],
    "scroll_size": 1000,
    "delayed_data_check_config": {
      "enabled": true
    }

I've tried removing foo.bar* from the list of indices so that the job config is just like this: (although I want to keep it there because new data might appear on the primary in future)

    "indices": [
      "*:foo.bar*"
    ],

but it fails in the same way.

My issue seems similar to [ML] Datafeed fails on missing indices, even with allow_no_indices set to true · Issue #62404 · elastic/elasticsearch · GitHub (and I do have allow_no_indices as true) but the difference here is that we have remote clusters involved, so there are actually indices available which the datafeed could start consuming.

The error I'm receiving seems to come from this code elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/scroll/ScrollDataExtractorFactory.java at v8.7.0 · elastic/elasticsearch · GitHub

I found an interesting comment here elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedNodeSelector.java at v8.7.0 · elastic/elasticsearch · GitHub which suggests there is an intention to succeed in the case of remote indices. However in my case it seems to have got past this point because a node has been allocated: if I look in Kibana at /app/ml/jobs expand the job row and look at 'Job messages' it says:

Opening job on node [instance-0000000090]

so it is failing at a later stage when it tries to actually consume the datafeed.

Is this a bug that I should report at GitHub - elastic/elasticsearch: Free and Open, Distributed, RESTful Search Engine do you think?

I noticed the code that produces the error is looking at a FieldCapabilitiesResponse so perhaps in fact there is an (implicit) requirement that the cluster where the ML job runs has at least 1 index matching the indices of the datafeed? (so that Field capabilities API | Elasticsearch Guide [8.12] | Elastic can return relevant data)

If so is this / should it be documented somewhere?

Is there any way around this given the setup described above? I don't want ML nodes on the remote clusters - even if I did choose one remote where foo.bar indices are present, it would only see its own indices, not those from other remotes.

In the meantime I've managed to work around this by posting a single fake document on the "primary" elastic cluster, so that the datastream exists and thus there is at least 1 matching index and the datafeed will start.

POST /foo.bar.datastream/_doc/
{
  "@timestamp": "2024-03-20T12:00:00.000Z",
  "log.level": "INFO",
  "log.logger": "not.a.real.logger",
  "log.origin.function": "not.a.real.function",
  "message": "This is not a real log message, it is posted from Kibana Dev Tools purely in order to ensure that a foo.bar* index exists so that the Anomaly Detection job will start"
}

I included a minimal set of fields which are all the ones referenced by my ML job (e.g. in the datafeed query / analysis_config / influencers) on the assumption that the field capabilities check was probably going to want to check on at least some of these.

This is of course not at all ideal - it requires polluting the production data with this random log line, just in order to get the datafeed started.

And if I want to create a new job in future (and this magic log line has been aged out & deleted by the ILM policies governing foo.bar) then I'll have to do this again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.