3/6 shards failing when trying to visualize on Kibana

Hi, I'm currently trying to do a simple POC for a potential monitoring system for my company using the Elastic stack. I have metricbeat & winlogbeat going through Logstash, which ships to Elasticsearch. I'm currently only using a single-node and would like to avoid clustering until we have a better idea of if we're going to be using the Elastic stack for our monitoring system. Right now, I have 4 of our servers set up for monitoring and it looks like data is being collected properly from each. However, when I try to graph the data, i.e. graph the number of hosts, it tells me 3/6 shards have failed and it shows 0 hosts. I can do queries in the discover section, but am unable to graph certain data. I'm also using the latest version of everything 7.2.0. Any help solving this problem is appreciated!

Additional Info:

Nodes: 1

Disk Available

34.47%

20.0 GB / 58.0 GB

JVM Heap

43.04%

1.7 GB / 4.0 GB

Indices: 511

Documents

17,510,180

Disk Usage

6.7 GB

Primary Shards

511

Replica Shards

0

Hi @sarahvo,

a common pitfall when ingesting data from metricbeat (and any other beat like winlogbeat or filebeat) through logstash is that the index templates are not installed automatically. When ingesting directly from metricbeat to Elasticsearch, the beat would take care of that. Without it, the ingested data might have a dynamic mapping that can prevent queries from working correctly.

Could you double-check that you have the correct metricbeat and winlogbeat index templates installed?

Sorry, how do I check that the correct index templates are installed?

You can use the Kibana dev tools to send custom http requests to Elasticsearch:

grafik

The metricbeat and winlogbeat index templates would be listed in the reponse to the request

GET _template/metricbeat-*,winlogbeat-*

If these are present and if the index_patterns pattern matches the index names, they will make sure that the auto-created indices will have the correct mapping (i.e. the correct field types).

You can check for the mappings by executing queries like

GET metricbeat-*/_mapping

and

GET winlogbeat-*/_mapping

If you posted the results of these queries here, I could take a look at them. Make sure the responses don't contain any sensitive data, though.

Thanks @weltenwort !

Here's a snippet of when I run: GET _template/metricbeat-,winlogbeat-

& GET metricbeat-*/_mapping

That looks good so far. What I can't see from the screenshots is whether all metricbeat-* indices (and all winlogbeat-*) indices have the same mapping. Could you either check that or paste the full text of the responses?

Since the full text responses were thousands+ lines long, I double checked the mapping and everything seems to look good. What else could be the issue?

Ok, assuming the mappings are all compatible, let's look at the shard states first using

GET _cat/shards

A second direction of investigation would be to look at the specific queries that fail. Could you give an example of a query that fails in the way you described? To find the query...

  • click "Inspect" in the visualization editor
    grafik
  • select "Requests" in the popover menu in the flyout
    grafik
  • copy the "Request" and "Response" JSON values here
    grafik
  • paste them into a reply here as "fenced" text
    ```json
    { JSON here }
    ```
    

At the same time it would be interesting to see whether the Elasticsearch process itself logs any error message to its console or log file.

Here's the the response I received for

GET _cat/shards

When I try to query the number of hosts, I receive the error that 1/2 shards have failed.
Here's the Request JSON value I received:

{
  "aggs": {
    "1": {
      "cardinality": {
        "field": "host.name"
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {},
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    },
    {
      "field": "ceph.monitor_health.last_updated",
      "format": "date_time"
    },
    {
      "field": "docker.container.created",
      "format": "date_time"
    },
    {
      "field": "docker.healthcheck.event.end_date",
      "format": "date_time"
    },
    {
      "field": "docker.healthcheck.event.start_date",
      "format": "date_time"
    },
    {
      "field": "docker.image.created",
      "format": "date_time"
    },
    {
      "field": "event.created",
      "format": "date_time"
    },
    {
      "field": "event.end",
      "format": "date_time"
    },
    {
      "field": "event.start",
      "format": "date_time"
    },
    {
      "field": "file.ctime",
      "format": "date_time"
    },
    {
      "field": "file.mtime",
      "format": "date_time"
    },
    {
      "field": "kubernetes.container.start_time",
      "format": "date_time"
    },
    {
      "field": "kubernetes.event.metadata.timestamp.created",
      "format": "date_time"
    },
    {
      "field": "kubernetes.event.timestamp.first_occurrence",
      "format": "date_time"
    },
    {
      "field": "kubernetes.event.timestamp.last_occurrence",
      "format": "date_time"
    },
    {
      "field": "kubernetes.node.start_time",
      "format": "date_time"
    },
    {
      "field": "kubernetes.pod.start_time",
      "format": "date_time"
    },
    {
      "field": "kubernetes.system.start_time",
      "format": "date_time"
    },
    {
      "field": "mongodb.replstatus.server_date",
      "format": "date_time"
    },
    {
      "field": "mongodb.status.background_flushing.last_finished",
      "format": "date_time"
    },
    {
      "field": "mongodb.status.local_time",
      "format": "date_time"
    },
    {
      "field": "mssql.transaction_log.stats.backup_time",
      "format": "date_time"
    },
    {
      "field": "nats.server.time",
      "format": "date_time"
    },
    {
      "field": "php_fpm.pool.start_time",
      "format": "date_time"
    },
    {
      "field": "php_fpm.process.start_time",
      "format": "date_time"
    },
    {
      "field": "postgresql.activity.backend_start",
      "format": "date_time"
    },
    {
      "field": "postgresql.activity.query_start",
      "format": "date_time"
    },
    {
      "field": "postgresql.activity.state_change",
      "format": "date_time"
    },
    {
      "field": "postgresql.activity.transaction_start",
      "format": "date_time"
    },
    {
      "field": "postgresql.bgwriter.stats_reset",
      "format": "date_time"
    },
    {
      "field": "postgresql.database.stats_reset",
      "format": "date_time"
    },
    {
      "field": "process.start",
      "format": "date_time"
    },
    {
      "field": "system.process.cpu.start_time",
      "format": "date_time"
    },
    {
      "field": "zookeeper.server.version_date",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2019-07-24T18:22:21.066Z",
              "lte": "2019-07-25T18:22:21.066Z"
            }
          }
        }
      ],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

and the Response JSON value:

 {
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 1,
    "skipped": 0,
    "failed": 1,
    "failures": [
      {
        "shard": 0,
        "index": "metricbeat-2019.07.24",
        "node": "MamkBKhkS3iZpspN5i-j_w",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [host.name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "1": {
      "value": 0
    }
  },
  "status": 200
}

Here are also some recent logs I pulled from elasticsearch:

"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [e5fcad9762b9][192.168.48.2:9300][indices:data/read/search[phase/query]]",
"Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. Set fielddata=true on [host.name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
"at org.elasticsearch.index.mapper.TextFieldMapper$TextFieldType.fielddataBuilder(TextFieldMapper.java:711) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.index.fielddata.IndexFieldDataService.getForField(IndexFieldDataService.java:116) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.index.query.QueryShardContext.getForField(QueryShardContext.java:179) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.aggregations.support.ValuesSourceConfig.resolve(ValuesSourceConfig.java:95) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder.resolveConfig(ValuesSourceAggregationBuilder.java:321) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder.doBuild(ValuesSourceAggregationBuilder.java:314) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder$LeafOnly.doBuild(ValuesSourceAggregationBuilder.java:42) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.aggregations.AbstractAggregationBuilder.build(AbstractAggregationBuilder.java:139) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.aggregations.AggregatorFactories$Builder.build(AggregatorFactories.java:332) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.SearchService.parseSource(SearchService.java:789) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.SearchService.createContext(SearchService.java:591) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:550) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:353) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145) ~[elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.SearchService$2.doRun(SearchService.java:1052) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]"] }

So sorry for the information overload. I really appreciate your help on this! Thank you!

So sorry for the information overload.

I asked for it, it's really helpful for tracking down the issue :wink: Thank you for providing these details.

The error message indicates that at least the index metricbeat-2019.07.24 has an unsuitable mapping for the host.name field (and others, I would guess). So I would like to return to investigating the mappings.

There seem to be two metricbeat indices (and winlogbeat too), one with a version number and one without. The query apparently targets metricbeat-* it queries both. If one of them has an incorrect mapping, some of the shards (those that belong to that index) would cause the query to fail.

Let's look at the mapping of the host.name field specifically across all metricbeat-* indices:

GET /metricbeat-*/_mapping/field/host.name

After this we will probably have to look at the metricbeat configuration to find out which of these indices is the correct one and where the other comes from.

Here's the mapping of the host.name field:

{
  "metricbeat-7.2.0-2019.07.24-000001" : {
    "mappings" : {
      "host.name" : {
        "full_name" : "host.name",
        "mapping" : {
          "name" : {
            "type" : "keyword",
            "ignore_above" : 1024
          }
        }
      }
    }
  },
  "metricbeat-2019.07.24" : {
    "mappings" : {
      "host.name" : {
        "full_name" : "host.name",
        "mapping" : {
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

In the output section of my beats.conf I created for logstash, I do have:
index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
as seen below (with the hosts, user, and pw erased):

 output {
	elasticsearch {
		hosts => ""
		index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
		document_type => "%{[@metadata][type]}"
		manage_template => false
		user => 
		password => 
	}
}

could that be part of the issue?

Here's part of my metricbeat.yml too:

 metricbeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

#==================== Elasticsearch template setting ==========================

setup.template.settings:
  index.number_of_shards: 1
  index.codec: best_compression
  #_source.enabled: false

Yes, it seems the name of the indices written to by logstash do not match the pattern of the metricbeat index template, which is metricbeat-7.2.0-*. That means that the indices created in response to logstash's write operations don't have the correct mapping. A quick fix would be to adjust the name of the indices of the Elasticsearch output in logstash to mimic the pattern normally used by metricbeat (i.e. include the 7.2.0 version number). Alternatively, the index template's pattern could be adjusted.

In both cases, extra care needs to be taken when updating metricbeat to ensure the new index templates are used. Writing to Elasticsearch directly from metricbeat would eliminate that step, because metricbeat would automatically install the new mappings when updated.

1 Like

Thank you!! I fixed the name of the indices of the Elasticsearch output in logstash to include the version number and I no longer have issues with my shards failing.

That's good to hear :+1:

As I wrote above, without metricbeat shipping data directly to Elasticsearch, the new index pattern will have to be installed in Elasticsearch and the logstash configuration updated with the new version number. You might be able to automate the latter step by reading the metricbeat version from the documents. There should be an agent.version field in each document that contains it.