Shards failed on a specific index after reducing number of primary shards

I have a 4 nodes ES cluster running on Elasticsearch 6.8.4. We recently had some issues with ELK restarting and having too many shards on them resulting in our nodes going crazy everytime it restarted. (Which happen pretty often since our plateform is quite unstable in general)

So i reduced the number of primary shards on some light index, which were supposed to have only 1 primary shard, but had 5 because it wasn't configured properly initially. The ES cluster is now quite stable.

But the problem is that since then, Kibana or Python API can't request this index even if this index seems to be filled. The index status is green and nothing seems to be wrong. Index logs are as usual, and there's data in it :

The change has been done on 17/01/2022, and this is what we have since then :

I assume it has to do with this change, but i don't understand why reducing primary shards on newly created index would result in an index which can't be requested anymore. And can it be fixed or do i have to revert back newly created index to 5 primary shards ?

Thanks in advance for your help, and sorry for bad english.

Welcome to our community! :smiley:

How did you do this?

What do your Elasticsearch logs show?
What is the output from the _cluster/stats?pretty&human API?

First, thanks you a lot for your answer, i'll try to answer you the best i can

I did what i did with multiple index, i added this mapping and passed it as body with the python ES library :

MAPPING_DATE_STATS = {
    'settings': {
        'index': {
            'analysis': {
                'analyzer': {
                    'default': {
                        'tokenizer' : 'standard',
                        'filter': ['standard', 'asciifolding']
                    }
                }
            }
       },
        'number_of_shards': 1
    },
}

So newly created index now have 1 primary shard instead of 5 (nothing was passed before)

Well, we have plenty of elastic logs so maybe i don't know where to search but,
Logs output from the cron that's linked to my problem :

2022-01-24 04:30:02,439 - root - INFO - Starting indexation
2022-01-24 04:31:55,013 - root - INFO - Creation index: hpc.metrics.gpfs.quota.fs-2022-01-24
2022-01-24 04:31:55,019 - elasticsearch - INFO - GET http://localhost:9200/ [status:200 request:0.005s]
/usr/local/lib/python3.6/site-packages/elasticsearch/connection/base.py:200: ElasticsearchWarning: In a future major version, this request will fail because this action would add [2] total shards, but this cluster currently has [20786]/[4000] maximum shards open. Before upgrading, reduce the number of shards in your cluster or adjust the cluster setting [cluster.max_shards_per_node].
  warnings.warn(message, category=ElasticsearchWarning)
2022-01-24 04:32:07,497 - elasticsearch - INFO - PUT http://localhost:9200/hpc.metrics.gpfs.quota.fs-2022-01-24 [status:200 request:12.477s]
2022-01-24 04:32:07,509 - elasticsearch - INFO - GET http://localhost:9200/ [status:200 request:0.002s]
/usr/local/lib/python3.6/site-packages/elasticsearch/connection/base.py:200: ElasticsearchWarning: The [standard] token filter is deprecated and will be removed in a future version.
  warnings.warn(message, category=ElasticsearchWarning)
2022-01-24 04:32:13,754 - elasticsearch - INFO - POST http://localhost:9200/_bulk [status:200 request:6.245s]
2022-01-24 04:32:13,814 - root - INFO - 184 records indexed

There's some warning but it's been there since i'm in charge of this. While searching for significant logs to answer you, i found this which seems to be linked to my problem :

[2022-01-25T09:37:09,957][DEBUG][o.e.a.s.TransportSearchAction] [node237] [hpc.metrics.gpfs.quota.fs-2022-01-24][0], node[XPM-ihZNREaF_ANtNtBJjQ], [R], s[STARTED], a[id=l20xcOenSP2os3TFDgZT-g]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[hpc.metrics.gpfs.quota.fs-*], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='1643103427223', requestCache=null, scroll=null, maxConcurrentShardRequests=20, batchedReduceSize=512, preFilterShardSize=21, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, source={"size":0,"timeout":"30000ms","query":{"bool":{"must":[{"query_string":{"query":"*","default_field":"*","fields":[],"type":"best_fields","default_operator":"or","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"analyze_wildcard":true,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},{"query_string":{"query":"*","default_field":"*","fields":[],"type":"best_fields","default_operator":"or","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"analyze_wildcard":true,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},{"range":{"timestamp":{"from":1640511429460,"to":1643103429460,"include_lower":true,"include_upper":true,"format":"epoch_millis","boost":1.0}}}],"must_not":[{"match_phrase":{"FileSet_Name":{"query":"adm_pbs","slop":0,"zero_terms_query":"NONE","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":[],"excludes":[]},"stored_fields":"*","docvalue_fields":[{"field":"@timestamp","format":"date_time"},{"field":"timestamp","format":"date_time"}],"script_fields":{},"aggregations":{"2":{"date_histogram":{"field":"timestamp","time_zone":"UTC","interval":"1d","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":1},"aggregations":{"3":{"terms":{"field":"FileSet_Name","size":400,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"1":"desc"},{"_key":"asc"}]},"aggregations":{"1":{"sum":{"field":"FS_Size"}},"4":{"range":{"field":"FS_Size","ranges":[{"from":0.0,"to":1.099511627776E12}],"keyed":true},"aggregations":{"1":{"sum":{"field":"FS_Size"}}}}}}}}}}}] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [node237][10.120.40.237:9300][indices:data/read/search[phase/query]]
Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. Set fielddata=true on [FileSet_Name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
        at org.elasticsearch.index.mapper.TextFieldMapper$TextFieldType.fielddataBuilder(TextFieldMapper.java:779) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.index.fielddata.IndexFieldDataService.getForField(IndexFieldDataService.java:116) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.index.query.QueryShardContext.getForField(QueryShardContext.java:177) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.support.ValuesSourceConfig.resolve(ValuesSourceConfig.java:95) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder.resolveConfig(ValuesSourceAggregationBuilder.java:317) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder.doBuild(ValuesSourceAggregationBuilder.java:310) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder.doBuild(ValuesSourceAggregationBuilder.java:37) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.AbstractAggregationBuilder.build(AbstractAggregationBuilder.java:139) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.AggregatorFactories$Builder.build(AggregatorFactories.java:335) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.AggregatorFactory.<init>(AggregatorFactory.java:187) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.<init>(ValuesSourceAggregatorFactory.java:40) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramAggregatorFactory.<init>(DateHistogramAggregatorFactory.java:54) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramAggregationBuilder.innerBuild(DateHistogramAggregationBuilder.java:442) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder.doBuild(ValuesSourceAggregationBuilder.java:311) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.support.ValuesSourceAggregationBuilder.doBuild(ValuesSourceAggregationBuilder.java:37) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.AbstractAggregationBuilder.build(AbstractAggregationBuilder.java:139) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.aggregations.AggregatorFactories$Builder.build(AggregatorFactories.java:335) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService.parseSource(SearchService.java:833) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService.createContext(SearchService.java:637) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:596) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService.access$100(SearchService.java:126) ~[elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359) [elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355) [elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1107) [elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.8.4.jar:6.8.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.4.jar:6.8.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

I don't quite understand the problem though.

And for the output of cluster/stats here's what i have :

{
  "_nodes" : {
    "total" : 4,
    "successful" : 4,
    "failed" : 0
  },
  "cluster_name" : "hpc-es-cluster",
  "cluster_uuid" : "PJ9rr9S0Q_yF1BH12WyOUQ",
  "timestamp" : 1643186784383,
  "status" : "yellow",
  "indices" : {
    "count" : 7125,
    "shards" : {
      "total" : 19788,
      "primaries" : 10409,
      "replication" : 0.9010471707176482,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 10,
          "avg" : 2.777263157894737
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.4609122807017545
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.8929403508771946
        }
      }
    },
    "docs" : {
      "count" : 2151914374,
      "deleted" : 1234388
    },
    "store" : {
      "size" : "1.3tb",
      "size_in_bytes" : 1461776372046
    },
    "fielddata" : {
      "memory_size" : "55.8mb",
      "memory_size_in_bytes" : 58544184,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "150.1mb",
      "memory_size_in_bytes" : 157481766,
      "total_count" : 25719576,
      "hit_count" : 13244924,
      "miss_count" : 12474652,
      "cache_size" : 156033,
      "cache_count" : 410476,
      "evictions" : 254443
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 117018,
      "memory" : "3gb",
      "memory_in_bytes" : 3223782096,
      "terms_memory" : "1.6gb",
      "terms_memory_in_bytes" : 1776594679,
      "stored_fields_memory" : "720.6mb",
      "stored_fields_memory_in_bytes" : 755620432,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "211.2kb",
      "norms_memory_in_bytes" : 216320,
      "points_memory" : "395.1mb",
      "points_memory_in_bytes" : 414376441,
      "doc_values_memory" : "264.1mb",
      "doc_values_memory_in_bytes" : 276974224,
      "index_writer_memory" : "13.5mb",
      "index_writer_memory_in_bytes" : 14239072,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "48mb",
      "fixed_bit_set_memory_in_bytes" : 50342136,
      "max_unsafe_auto_id_timestamp" : 1643179134545,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 4,
      "data" : 4,
      "coordinating_only" : 0,
      "master" : 4,
      "ingest" : 4
    },
    "versions" : [
      "6.8.4"
    ],
    "os" : {
      "available_processors" : 96,
      "allocated_processors" : 96,
      "names" : [
        {
          "name" : "Linux",
          "count" : 4
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 4
        }
      ],
      "mem" : {
        "total" : "499.7gb",
        "total_in_bytes" : 536634761216,
        "free" : "64.6gb",
        "free_in_bytes" : 69364830208,
        "used" : "435.1gb",
        "used_in_bytes" : 467269931008,
        "free_percent" : 13,
        "used_percent" : 87
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 7
      },
      "open_file_descriptors" : {
        "min" : 52763,
        "max" : 79157,
        "avg" : 64603
      }
    },
    "jvm" : {
      "max_uptime" : "36.7d",
      "max_uptime_in_millis" : 3178970744,
      "versions" : [
        {
          "version" : "1.8.0_262",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.262-b10",
          "vm_vendor" : "Oracle Corporation",
          "count" : 4
        }
      ],
      "mem" : {
        "heap_used" : "61.6gb",
        "heap_used_in_bytes" : 66205524200,
        "heap_max" : "123.4gb",
        "heap_max_in_bytes" : 132515889152
      },
      "threads" : 1484
    },
    "fs" : {
      "total" : "9.2tb",
      "total_in_bytes" : 10165608480768,
      "free" : "7.4tb",
      "free_in_bytes" : 8241318002688,
      "available" : "7tb",
      "available_in_bytes" : 7724837912576
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 4
      },
      "http_types" : {
        "security4" : 4
      }
    }
  }
}

Thanks in advance for your help

Turns out one my fields : Fileset_Name was automatically turned into a keyword : Fileset_Name.keyword (sorry for all those stupid assessement, i don't know where to search sometimes). Which is a bit tricky because you don't see this name being changed everywhere (not in Kibana discover for example).

Figured it has to do with this :

Fielddata is disabled on text fields by default. Set fielddata=true on [FileSet_Name] in order to load fielddata in memory by uninverting the inverted index.

Changed my mapping of this index a bit, and everything's in place now.

Sorry for the stupid post since i solved my problem just by thinking a bit

Great! However you have a much larger issue here that you need to address pretty urgently.

Your shard size is 0.065GB on average. That's way too small and you need to look at fixing that. The fix will depend on what sort of data you are storing (eg time based, or other).

Also please upgrade, 8.0 is just around the corner and 6.X was released a few years ago now.

Thanks a lot for your answer.

I'm open to some suggestion but for some context : This ES has been deployed without all of this in mind, and i'm just trying to keep it alive. Since it's monitoring a very huge cluster, i won't be able to get rid of past data. There are several index monitoring disk usage, jobs or nodes usages. One index for each of those metrics is created everyday and filled with everyday info.

Our ES has been really instable recently and the only solution we found to make it work, without losing some data has been to reduce shard number. So you're right pointing this.

If you have any suggestion on how to increase shards size (or reducing shard number would be smarter in this case) i would be glad to hear it

Also, I suggested upgrading, but the since it's supposed to be replaced by some industrialized ELK stack in the near future (a year), they don't want me to upgrade since they fear we lose some data, or everything is broken.

Thanks a lot for your answer, this really helps me understand a bit more ES.

The first step is usually to change your sharding strategy so you stop creating a lot of new small indices. Merge indices together and consider changing the time period each index covers, e.g. go from daily indices to weekly and/or monthly ones. If you have a defined retention period this will reduce the shard count over time.

If you are having stability issues (quite possible as you have far, far too many shards for a cluster that size) you may need to 1) delete data 2) use the shrink index API to reduce the number of primary shards for existing indices 3) Reindex old indiced into larger ones and then cleaning up the old ones.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.