5 of 30 shards failed in Elastic 5.5.1

My project is hitting a '5 of 30 shards failed' error when searching from Kibana. This is very similar to these postings:

Inspecting the browser requests, we do find:

"reason":{
"type":"illegal_argument_exception",
"reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [ts] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}

We have confirmed that we have multiple components of the system sending logs where sometimes the 'ts' field is a date and sometime it is text type. We would expect it to result in the ts field not being searchable, that would be 'good', but because it's triggering this "shards failed" error the kibana queries just fail and return no logs. It seems like somehow ES is using different mappings on different shards exactly like was described in the above links.... but that should have been fixed in ES v2.x (the 2.x pull request was even linked in the second one) and we're on v5.5.1.

The problem is not consistent -- we have one system where we can see that the 'ts' value is not always the same type, the Kibana UI shows an "!" icon which expands to say:

Field Type Conflict: The type of this field changes across indices. It is unavailable for many analysis functions. The indices per type are as follows:
Field Type Index Names
text logstash-2018.12.27
float logstash-2018.12.28

This would be 'fine'/'good'.

But on another system this does not happen and instead the Kibana UI just says its a 'date' type (no "!" icon) but we get "shards failed" errors and cannot see the logs at all as a result ('bad').

How can we investigate the root cause further and find a solution?

Paul C

The root cause is that you have different mappings for the fields with the same type. It can happen if this field is dynamically mapped and the data that can be found in this field can differ greatly. In this case the first log message of the day that contain this field will dictate the type of the field stored. For example, if the field contains "foobar" it will be mapped as text, but if the first log of the day contains 42.0 there, it will be mapped as float

The simplest way to fix it is by mapping the field explicitly in the template.

Thanks for the response Igor_Motov. The problem isn't so much that the field could be mapped as float or text depending on which log message comes in first. That's what we're seeing in the system that says "Field Type Conflict" -- that is acceptable, and is not a problem. The problem is the other system where we actually get a "shards failed" error preventing us from querying any of the logs. Are you saying that "shards failed" is also normal/expected for elasticsearch when dynamic mapping is used and the data is not perfectly consistent?
Our scenario does call for using dynamic mapping. It might also be worth pointing out we are using the logstash json filter plugin to dynamically parse the fields.

I interpreted the above linked discussions (also this one) to mean that this inconsistent mapping, where one shard maps with one type and another shard maps with a different type, within the same index, should not happen since v2.x. E.g.:

On 2.0, this issue will be fixed as dynamic mappings will have to be validated on the master node first before being applied

and

Primary one thinks it's a String.
Primary two thinks it's a Date.
And you have kind of inconsistent mapping here.
That happened in the past. This is no longer true with recent versions.

It seems to be describing what we're seeing with v5.5.1 however, and it's not clear why (or how we can fix it so it consistently behaves like the 'good' system with the 'Field Type Conflict')

Can you reproduce it while searching a single index?

Yes, on the 'bad' system, when we search for logs in the last 15 minutes (just searching in the index for the current day) we get the "5 of 30 shards failed" error and cannot retrieve logs.

Does your today's index contain 30 shards?

Does your today's index contain 30 shards?

I'm not 100% sure (I'm asking my team-mate who has access to this system with the problem). But when we adjust the time range so that it only queries within the time range for December 25th for example, logs are returned. When the time range includes any of the daily indices after the 25th then the "X of Y shards failed" messages are seen and no logs returned.

Did some more looking today and now the error says "5 of 40 shards failed" -- we don't think the index has 30 or 40 shards, each daily index has 5 shards. When we first saw the problem (dec 26) it had been running 4 days and said "5 of 20". Four days after that (dec 30) it's now saying "5 of 40", so the second number is increasing by 5 each day, for each new daily index.

When we set the Kibana Ui to show logs for the last 30 days, and a filter specifying the _index is 'logstash-2018-12-26' it says '5 of 40 shards failed'. When we do the same for the 27th and 28th, the same. When we do it for the 29th or 30th indexes however the queries can succeed. So somehow those three seem affected, yet it always says 5 of N shards failed. This might just be due to how much data is paged at a time -- each query that fails, even for 30 days of history, is just loading and hitting errors in one index before stopping (just a guess).

We would like to identify how this problem occurs since it still seems to us to be a regression of that old v1 issue somehow. But we'd also like to find a workaround -- we can see that the logs are flowing into the system, we just can't query them out via Kibana successfully when this occurs. Is there a way to tell elasticsearch to alter an index already showing the problem to explicitly treat a problematic field as text? or a way to query telling elasticsearch to ignore the ts field and not even attempt to use it?

We would like to reinstall this system and see if the problem happens again... but as it seems it does not always happen we're hesitant to do so as we may lose ability to debug. Would like at least a workaround in case we see the problem again in production.

The most likely scenario is what I described in my first reply 3 days ago. You can check the mapping for this field in daily indices and see how it changes over time to validate this theory.

No, once it happened, you need to reindex the problematic index while manually created the mapping with correct setting. In the future you should update the template to make sure that this field is always recognized as text.

That is completely unnecessary. The problem is not in software, but in the mapping. Just recreating the index with correct mapping or just deleting problematic indices if you don't need data in them. Should fix the issue.

The solution for this issue is fixing the template.

1 Like

The reinstall is actually to allow testing another scenario. We wanted to see if the problem occurs again, but we're not expecting the reinstall to be necessary to avoid or trigger this problem.

In our scenario, we need to use dynamic mapping, we do not have a set template -- in fact we don't know even what fields to expect in the logs before hand, let alone their types. So I don't think 'fixing the template' is an option for us... or else I'm not understanding the suggestion.

We did reinstall and did hit the problem again. But we were able to pound in more investigation and appear to have found a workaround:

  • When "X of Y shards failed" message is seen, select "Management" from the Kibana UI
  • Select "index patterns"
  • Click 'refresh' in the top right

Before clicking refresh, the problematic field ('ts' for us) had a type like 'date'. After the refresh the problematic field is noted with a '!' icon and that it has a type conflict. Once Kibana recognizes that there is a conflict, it is able to query successfully without hitting the "X of Y shards failed" error presumably because it does not try to query using the problematic field. As I've noted before, we're fine with potential type conflicts and not searching on those. We don't know all the field and types beforehand.

Thus from our perspective the problem appears to be that Kibana doesn't automatically detect that there's a conflict, and a user must manually go and click the 'refresh' button. Shouldn't Kibana be able to detect this automatically somehow? In our dev system this seems to have happened somehow -- maybe because the conflict was already there before the kibana UI was loaded the first time, while perhaps the test environment (with the problem) loaded the kibana UI before the type conflict was introduced. Obviously lots of guesswork going on here, very much unclear why manually refreshing is necessary.

It is a known elasticsearch limitation that with dynamic mapping the first record of the index determines the type of this field. It is also not possible to run an aggregation on multiple fields with the same name but different types. Kibana is aware of this problem but it cannot solve it.

There are really only 3 solutions for these:

  1. add all fields that can contain ambiguous data to the template with correct mapping. That doesn't seem to be an option for you since you don't know which fields you are going to have.
  2. Map all fields as strings. That will prevent you from using range queries on numeric fields or date queries on date fields but might be a solution if most of your fields can be treated as strings
  3. Add some preprocessor, that will modify the field names based on their true types (if you know such types, or can deduces them). For example, if you know that a particular fieldts should contain date rename it to ts_dt, if you know that a field val should contain integer, rename it to val_int and so on. Then use dynamic_templates to map them correctly based on the suffix.

Thanks Igor for the quick reply. I agree that Kibana can't solve the problem of type conflicts. But shouldn't it be able to detect and tolerate it automatically? If our understanding is correct (it seems to be), then the "bug" seems to be that Kibana requires a user to manually find and click the 'refresh' button to detect that there is a conflict (sometimes). Kibana just putting up that "X of Y shards failed" error and failing the query seems unnecessarily painful to understand and deal with.

Sorry, I understand now. You created the topic in the Elasticsearch category, so I thought you are interested in elasticsearch perspective and got carried away with explanations :slight_smile:

As far as kibana is concerned, I think it is already fixed in 5.5.2 and above.

THANK YOU! We weren't able to identify where the root cause was originally -- it looked to us that ES was doing something wrong with the indexes/shards. I can't quite follow what the 'fix' in Kibana actually does, but it does seem to be our problem (supported by the fact that manually refreshing the fields list in Kibana seems to work around the problem as well). And it is getting fixed in 552 and we're on 551 of course.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.