ML - Datafeed is encountering errors extracting data: all shards failed

Hi!

I have created a ML job, but when I select Start Datafeed, I get this error: Datafeed is encountering errors extracting data: all shards failed, as in the image below.

I really dont have any idea why I get this. I have used the same configuration some months ago and it worked with no problems. But today it doesn't work at all.

If you can give me a piece of advice for solving this problem, I would appreciate a lot.

What is the datafeed configuration?

(it would be the output of the following)

GET _ml/datafeeds/datafeed-dns_exfiltration

Is the index pattern referenced in the datafeed configuration queryable?

GET yourindexname/_search

(where yourindexname is whatever is found in the indices section of the response from the first command)

1 Like

Hello @richcollier! It is a pleasure for me to meet you again! :slight_smile: You helped me in some previous posts and thank you a lot. You are a guru in ELK ML :slight_smile:

To respond to your questions, for the command GET _ml/datafeeds/datafeed-dns_exfiltration, I get:

{
  "count" : 1,
  "datafeeds" : [
    {
      "datafeed_id" : "datafeed-dns_exfiltration",
      "job_id" : "dns_exfiltration",
      "query_delay" : "65004ms",
      "indices" : [
        "packetbeat-*"
      ],
      "query" : {
        "bool" : {
          "should" : [
            {
              "match_phrase" : {
                "type" : "dns"
              }
            }
          ],
          "minimum_should_match" : 1,
          "filter" : [
            {
              "match_phrase" : {
                "type" : "dns"
              }
            }
          ],
          "must_not" : [ ]
        }
      },
      "script_fields" : {
        "hrd" : {
          "script" : {
            "source" : "return domainSplit(doc['dns.question.name'].value).get(1);",
            "lang" : "painless"
          },
          "ignore_failure" : false
        },
        "sub" : {
          "script" : {
            "source" : "return domainSplit(doc['dns.question.name'].value).get(0);",
            "lang" : "painless"
          },
          "ignore_failure" : false
        }
      },
      "scroll_size" : 1000,
      "chunking_config" : {
        "mode" : "auto"
      },
      "delayed_data_check_config" : {
        "enabled" : true
      }
    }
  ]
}

As regards this command, GET packetbeat-*/_search, I obtain:

{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
  },
  "hits" : {
"total" : {
  "value" : 1128,
  "relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
  {
    "_index" : "packetbeat-7.6.2",
    "_type" : "_doc",
    "_id" : "mHtwxXcBN1tfp17DWxnv",
    "_score" : 1.0,
    "_source" : {
      "@timestamp" : "2021-02-21T16:33:30.001Z",
      "network" : {
        "bytes" : 1099865,
        "packets" : 4023,
        "type" : "ipv4",
        "transport" : "tcp",
        "community_id" : "1:YqII7mbgOpdT5mf3MMJdPvr+dhk="
      },
      "host" : {
        "name" : "ubuntu",
        "hostname" : "ubuntu",
        "architecture" : "x86_64",
        "os" : {
          "codename" : "bionic",
          "platform" : "ubuntu",
          "version" : "18.04.1 LTS (Bionic Beaver)",
          "family" : "debian",
          "name" : "Ubuntu",
          "kernel" : "4.15.0-29-generic"
        },
        "id" : "8f68089f99fc4e6db58b1d98c7ee3d64",
        "containerized" : false
      },
      "ecs" : {
        "version" : "1.4.0"
      },
      "type" : "flow",
      "source" : {
        "bytes" : 786344,
        "ip" : "127.0.0.1",
        "port" : 58526,
        "packets" : 2408
      },
      "destination" : {
        "packets" : 1615,
        "ip" : "127.0.0.1",
        "port" : 9200,
        "bytes" : 313521
      },
      "event" : {
        "end" : "2021-02-21T16:33:29.752Z",
        "duration" : 2627491892158,
        "dataset" : "flow",
        "kind" : "event",
        "category" : "network_traffic",
        "action" : "network_flow",
        "start" : "2021-02-21T15:49:42.261Z"
      },
      "agent" : {
        "type" : "packetbeat",
        "ephemeral_id" : "00bfd7ae-a87e-4c49-9a33-0e46bea2b072",
        "hostname" : "ubuntu",
        "id" : "b306d0ba-7d77-4c5b-a009-a2bcb70c4922",
        "version" : "7.6.2"
      },
      "flow" : {
        "id" : "EAT/////AP//////CP8AAAF/AAABfwAAAZ7k8CM",
        "final" : false
      }
    }
  },
  ....... and so on
  }
}

thanks for that info. At least we know the index exists and that it is queryable. Let's now validate that the type of query this ML job is configured for will also execute. Can you try to see if the following query returns without problems/errors?

GET packetbeat-*/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "type": "dns"
          }
        }
      ],
      "minimum_should_match": 1,
      "filter": [
        {
          "match_phrase": {
            "type": "dns"
          }
        }
      ],
      "must_not": []
    }
  },
  "script_fields": {
    "hrd": {
      "script": {
        "source": "return domainSplit(doc['dns.question.name'].value).get(1);",
        "lang": "painless"
      },
      "ignore_failure": false
    },
    "sub": {
      "script": {
        "source": "return domainSplit(doc['dns.question.name'].value).get(0);",
        "lang": "painless"
      },
      "ignore_failure": false
    }
  }
}

If that does work, by the way, then there must be some other thing logistically wrong and we would need to look more closely at the elasticsearch.log file for detailed errors. To do that, probably the easiest way would be to clone the existing job and force it to run over some historical data immediately. That way, you can look in the elasticsearch.log file for the errors that are incurring when you forced it to run.

When I run this, I get an error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "script_stack" : [
          "org.elasticsearch.index.mapper.TextFieldMapper$TextFieldType.fielddataBuilder(TextFieldMapper.java:762)",
          "org.elasticsearch.index.fielddata.IndexFieldDataService.getForField(IndexFieldDataService.java:116)",
          "org.elasticsearch.index.query.QueryShardContext.lambda$lookup$0(QueryShardContext.java:311)",
          "org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:101)",
          "org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:98)",
          "java.base/java.security.AccessController.doPrivileged(AccessController.java:312)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:98)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:41)",
          "return domainSplit(doc['dns.question.name'].value).get(1);",
          "                       ^---- HERE"
        ],
        "script" : "return domainSplit(doc['dns.question.name'].value).get(1);",
        "lang" : "painless"
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "packetbeat-7.6.2",
        "node" : "bdCh8LxiRIu9AuIEshOKdg",
        "reason" : {
          "type" : "script_exception",
          "reason" : "runtime error",
          "script_stack" : [
            "org.elasticsearch.index.mapper.TextFieldMapper$TextFieldType.fielddataBuilder(TextFieldMapper.java:762)",
            "org.elasticsearch.index.fielddata.IndexFieldDataService.getForField(IndexFieldDataService.java:116)",
            "org.elasticsearch.index.query.QueryShardContext.lambda$lookup$0(QueryShardContext.java:311)",
            "org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:101)",
            "org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:98)",
            "java.base/java.security.AccessController.doPrivileged(AccessController.java:312)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:98)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:41)",
            "return domainSplit(doc['dns.question.name'].value).get(1);",
            "                       ^---- HERE"
          ],
          "script" : "return domainSplit(doc['dns.question.name'].value).get(1);",
          "lang" : "painless",
          "caused_by" : {
            "type" : "illegal_argument_exception",
            "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [dns.question.name] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
          }
        }
      }
    ]
  },
  "status" : 400
}

Do you know how could I solve this?

The field dns.question.name needs to be of type keyword for it to comply with ECS (see DNS Fields | Elastic Common Schema (ECS) Reference [1.8] | Elastic). You can confirm what type it is set to in your index via:

GET packetbeat-*/_mapping/field/dns.question.name

If you ingested the data via packetbeat, it should have done this for you - I'm not sure why it wouldn't have. How are you ingesting this data?

Hi @richcollier!

After running the above GET command, I obtained this:

{
  "packetbeat-7.6.2-2021.02.22-000001" : {
    "mappings" : {
      "dns.question.name" : {
        "full_name" : "dns.question.name",
        "mapping" : {
          "name" : {
            "type" : "keyword",
            "ignore_above" : 1024
          }
        }
      }
    }
  }
}

Yes, all the data is ingested via packetbeat.

However, before getting the error with all shards failed for the job with DNS Data Exfiltration, I got this one:

"{\"error\":{\"root_cause\":[{\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\"}],\"type\":\"status_exception\",\"reason\":\"Could not open job because no ML nodes with sufficient capacity were found\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [dns_exfiltration] on node [{ubuntu}{ml.machine_memory=4112064512}{ml.max_open_jobs=20}], because this node has insufficient available memory. Available memory for ML [1233619353], memory required by existing jobs [1115684864], estimated memory required for this job [1084227584]]\"}},\"status\":429}"

Also I have no idea, how I could solve this...

After that, I have tried to create a new job for HTTP Data Exfiltration, using the GitHub Security Analytics Recipes, but when I pressed Start Datafeed, I got some errors as regards mapping. In the example below, the error is for http.request.headers.host, but I also got for bytes_in:

"{\"error\":{\"root_cause\":[{\"type\":\"status_exception\",\"reason\":\"[datafeed-http_data_exfiltration] cannot retrieve field [http.request.headers.host] because it has no mappings\"}],\"type\":\"status_exception\",\"reason\":\"[datafeed-http_data_exfiltration] cannot retrieve field [http.request.headers.host] because it has no mappings\"},\"status\":400}"

It is like I am in a chain of errors :pensive:

Your query results imples that the packetbeat-* index pattern is matching an index called packetbeat-7.6.2-2021.02.22-000001 which has the right mappings, but I wonder if there are other, older indices that match the pattern but don't have the right mappings (??). Can you verify how many indices match the index pattern?

GET _cat/indices/packetbeat-*

I solved this problem. It was because I forgot to put send_all_headers: true for packetbeat.protocols.http, in the packetbeat.yml file.

The insufficient capacity message is telling you that your 4GB system is too small for the job you're asking ML to execute (which might take around 1GB on its own). The amount of memory ML is allowed to use of the machine is governed by settings shown in: Machine learning settings in Elasticsearch | Elasticsearch Reference [7.11] | Elastic

Specifically, xpack.ml.max_machine_memory_percent which defaults to 30% Therefore a 4GB node will allow about 1.2GB to be allocated to ML. Since you already have about 1.1GB being allocated to existing jobs (according to the error message):

node [{ubuntu}{ml.machine_memory=4112064512}{ml.max_open_jobs=20}], because this node has insufficient available memory. Available memory for ML [1233619353], memory required by existing jobs [1115684864], estimated memory required for this job [1084227584]]\"}}

...then you cannot open up any more jobs that require a lot of memory to run. Solutions:

  1. Get a bigger node
  2. Delete other jobs
  3. Slightly increase xpack.ml.max_machine_memory_percent, but be careful. ML operates outside the JVM heap. You need RAM for the JVM, ML, core Linux, and anything else running on that node.
1 Like

Hi @richcollier! Thank you for your response. I finally uninstalled elasticsearch, installed it again and things seem to work fine. However, I tried to create another job for suspicious process activity, as described in the Github Security Analytics Recipes:

PUT _ml/anomaly_detectors/suspicious_process_activity
{
  "description": "Suspicious Process Activity",
  "analysis_config": {
    "bucket_span": "5m",
    "influencers": [
      "auditd.log.a0",
      "host.name"
    ],
    "detectors": [
      {
        "function": "rare",
        "by_field_name": "auditd.log.a0",
        "partition_field_name": "host.name"
      }
    ]
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  },
  "model_plot_config": {
      "enabled" : true
  }
}

and

PUT _ml/datafeeds/datafeed-suspicious_process_activity
{
  "job_id": "suspicious_process_activity",
  "indices": [
    "filebeat-*"
  ],
  "query": {
    "term": {
      "event.action": {
        "value": "EXECVE"
      }
    }
  },
  "query_delay": "60s",
  "frequency": "300s",
  "scroll_size": 1000
}

I queried Kibana this way, "event.action": "EXECVE" and got results as in the image below:

However, I got the following error, Datafeed lookback retrieved no data and dont understand why :frowning:

Can you help me, please?

Certainly nothing obvious from the job or datafeed config. How did you start the datafeed? Did you have it look back over any historical data?

In the last image I have posted, I have pressed on the 3 dots of the suspicious_process_activity job and I selected Start Datafeed. In the window that appeared, I have selected from the beginning of data until now (real-time search).

Ok - that's good to know. So now you need to systematically figure out why you're not getting data. Again, like we did earlier, you need to simulate the query that the datafeed is doing against the raw data, as in:

GET filebeat-*/_search
{
  "query": {
    "term": {
      "event.action": {
        "value": "EXECVE"
      }
    }
  }
}

and make sure that the query yields results, and those results contain the fields that you care about (in this case, auditd.log.a0 and host.name)

I know that you tried in Kibana, but I prefer to do it via the _search API and use the exact query that the datafeed uses to eliminate variables

1 Like

Out of curiosity, if you've re-installed the stack from scratch, why aren't you using v7.11?

Hi @richcollier! Thank you very much for your explanations! You nicely exposed to me the necessary steps to debug the problem. After running the GET command inside the Dev Tools section of Kibana, I got the following result:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

You were right that something may be strange here, although in Kibana I got results. :clap: So, it seems that here there are some problems.

Yes, a very good question. I have used ELK v.7.6.2 some months ago when I tried to detect DNS Data Exfiltration, but in a few days I will have a presentation at my university where I will show how anomalies can be detected using ELK. Giving the fact that there remained just a few days until presentation, I preferred to use something I already know and have worked, then something new which could have required some time to get accustomed to.