Cluster management: 2000+ open active shards

Thank you very much for the constant support you are providing. It definitely saved me a panic attack. The only reason I have a multi-cluster environment is because I cannot afford to lose the incoming logs. Hence one data node at a time, I carry out updates and OS reboots. A helpful engineer from Elastic did mention about caching logs at the logstash (which is running on a Raspberry Pi for me :slight_smile: ) but the work pressure and studies was keeping me fully occupied.

I am sorry to have asked for an extended meet, I was not aware of the community rules. I definitely do not want to break any rules.

I've taken the steps I understood but not merged or deleted indexes. I will wait for your avaialblity to carry out such changes. Since the SSD is full, I am unable to take a snapshot of the VM as a backup.

I do not have any backup of the data unfortunately. I don't have the storage onboard. I can maybe attach an external HDD to the workstation, create a volume and mount it to the VM for a backup? I will look at mounting an external disk to the VM, and request your guidance for the snapshot :smiley: . That would definatly help me sleep better knowing there is backup of the data.

I know about snapshots but I have never been able to implement it. Hopefully this time around I will with you guidance.

I don't mind moving one of the hosts to the NAS but I see it as a serious bottleneck for IOPS. Let me see if I can measure it before movement. Since I have seperated the disks for OS and Elasticsearch data, I can easily mount second disk and move the data to see and if need be revert to the first one. I'll keep you posted on this one.

Combing indexes and deleting is something I'd like to do with your guidance as I am not even "good" at Elastic administration, hopefully I'll get certified soon. One more for college :slight_smile:

While I see "trial" license in the stack management page, I reckon it is platinum license since I have access to ML and Kibana Graphs for Link analysis.

Edit 1:
So the open shards are above 2000 and hence I cannot export reports for college. :expressionless:

Thank you very much once again and I hope you and your loved ones are safe and healthy

hi @parthmaniar

You are doing a great job!, and no worries about asking at all I just wanted to set expectations.

Here are some things you can do.

You can raise the number of shards per node setting this will allow you to do some work. This is not a long term solutions as you still have too many shards but this will allow you some headroom to works.

PUT _cluster/settings 
{
  "persistent": { "cluster.max_shards_per_node":  1200 }
}

Then, you have about 125 Indices with 0 documents I would delete those that will free up shards.

Then I would look at those 100s of tiny indices and reindex them into larger indices.
Do them by type example reindex all those small auditbeat indices into 1 auditbeat indices.

If you use the destination name starting with the same name like auditbeat-7.12.0-2020-reindex it will use the correct mapping / schema

You can do this carefully and slowly and start to free up shards. Make sure the the number of documents add up then you can delete the old tiny indices.

And yes even a NAS to to a snapshot / backup to would give you more peacefullness.

Do these thing then come back.

There are some other issues I can see but these will help first.

1 Like

I reckon the pain from yesterday is back. I am unable to login to the stack via Kibana. I can see the storage is at 90% (from 77%). I can see that there are 100GB of logs created. I am removing those.
I have only used the link analysis feature today, does this cause huge logs to be generated?

Edit - 1:
I restarted Elasticsearch service and did a health check via postman to see this:

{
    "cluster_name": "data-analytics-1",
    "status": "red",
    "timed_out": false,
    "number_of_nodes": 3,
    "number_of_data_nodes": 2,
    "active_primary_shards": 814,
    "active_shards": 814,
    "relocating_shards": 0,
    "initializing_shards": 2,
    "unassigned_shards": 1484,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 13,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 563,
    "active_shards_percent_as_number": 35.391304347826086
}

Here is the output of this command:

#! [node.master] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
#! [node.transform] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
#! [node.data] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
#! [node.remote_cluster_client] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
#! [node.ingest] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
#! [node.ml] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
{
  "acknowledged" : true,
  "persistent" : {
    "cluster" : {
      "max_shards_per_node" : "1200"
    }
  },
  "transient" : { }
}

There's a warning on the index page:

This number was much larger last I remember. I hope I haven't lost data. Let me quickly check the most important index I have. Thank you.

This index is called "cowrie" can I make sure no document from the index is ever deleted as part of any lifecycle management. <--- Super important 3 years of research data is in this index :smiley:

You can turn off the audit logging if you do not need it.

xpack.security.audit.enabled to false

You will need to re-run the following command everytime after you flood the disk

PUT /*/_settings
{
  "index": {
    "blocks.read_only_allow_delete": false,
    "blocks.read_only": false
  }
}

Yes I am sure you have ILM issues but unless you specifically set up delete phases there are no default delete phase for indices so that should not be a problem.

The problem will be that your indices and shard will continue to get bigger. but the cowrie seem to at least be rolling over so that is good.

It looks like you reduced the shards that is good!

Hi so I've changed the audit setting on all of the three hosts to false. Thank you for that.
I am restarting the service now.

Here is the output for the command you gave:

#! this request accesses system indices: [.apm-agent-configuration, .apm-custom-link, .async-search, .kibana_1, .kibana_2, .kibana_3, .kibana_4, .kibana_5, .kibana_6, .kibana_7, .kibana_7.12.0_001, .kibana_7.12.1_001, .kibana_task_manager_1, .kibana_task_manager_2, .kibana_task_manager_7.12.0_001, .kibana_task_manager_7.12.1_001, .reporting-2020-11-22, .reporting-2021-01-31, .reporting-2021-02-07, .reporting-2021-02-28, .reporting-2021-03-14, .reporting-2021-04-04, .reporting-2021-04-11, .reporting-2021-04-18, .reporting-2021-04-25, .security-7, .tasks, .transform-internal-005, .transform-internal-006], but in a future major version, direct access to system indices will be prevented by default
#! Overriding settings on system indices: [.transform-internal-*] -> [index.blocks.read_only, index.blocks.read_only_allow_delete], [.tasks*] -> [index.blocks.read_only, index.blocks.read_only_allow_delete], [.security-[0-9]+] -> [index.blocks.read_only, index.blocks.read_only_allow_delete]. This will not work in the next major version
{
  "acknowledged" : true
}

How do I work on the ILM issues?

I don't think you need to worry about ILM at this moment. I'm not sure even exactly what the issues are but some of the index names tell me ilm is not correct but your big index seems to be working okay so I'm not sure I would focus on ILM right now

Cleaning up and combining indexes and getting more storage is much more important in my opinion.

Taking a snapshot very important when you can.

So I know this is spoon feeding and I am sorry for that. I apologise and appreciate for this guidance.

I've ordered a 2TB storage which will ensure we are covered for next few months. I will be going to get the storage but it will take time since I have a dell workstation and there's an OEM power cable for which I need a converter (amazing) which will take 2 weeks to come :expressionless:

What do you reckon is the best was I can monitor the health of the cluster? It is very difficult to keep going away from studies when something goes wrong. Is there a way I can have a glance to make sure ingestion and cluster is good to go?

For example in meticbeats while using stack monitoring I get error:

[index_closed_exception] closed, with { index_uuid="OeFKiokSRviOkkjE9i12JA" & index="metricbeat-7.9.3-2021.02.21-000005" }: Check the Elasticsearch Monitoring cluster network connection or the load level of the nodes.

Also, for combing indexes & snapshots, what is/are the command and what are the prerequisites?

How can I make things robust and not bother the community like a student on red bull :smiley:

Hi @parthmaniar

  1. Already gave you the docs for combining indexes above.

Monitoring : The metricbeat is probably failing because the other issues. I would suggest reading the monitoring.

And apologies but I can not "spoon feed" you all this, we have great documentation and there are many other people to help.

Good Luck.

Thank you very much @stephenb . This has been immensity helpful. I will keep the thread open and add my experience (steps taken and experiences) while implementing your recommendations.

It's Monday we have made it to a whole new week! Have a good one :slight_smile: