Elasticsearch

Being honest, it is really confusing what you are trying to do and what is your issue.

But to make some things clear:

If you delete the data of your nodes, this will be seen as a new cluster and you need to have the setting cluster.initial_master_nodes, once a cluster is formed you do not need this and should remove the setting.

You need a quorum to form a cluster, you cannot have a quorum with 2 master nodes, you need 3 master nodes running on the same version to be able to form a cluster.

If you are trying anything different from this I would not expect it to work.

What is a blocker? It is not clear.

1 Like

Your company policy does not a bug make.

I am very dubious on this company policy angle, perhaps you have misunderstood something about a generic guideline on building redundant solutions ?

@leandrojmp you wrote:

and referenced the documentation. Perhaps you might explain a little deeper, because I can see how some of the wording there could be a potentially confusing. @Varinder hasn’t given the sequence of events, what was shutdown when, what was upgraded when, what was state of cluster when he shutdown/upgrades, … This may impact things, but I have certainly 3-node clusters that have re-formed “fine”, meaning a node short, with only 2 nodes running, see below:

% escurl /_cat/nodes
192.168.122.177 48 92 1 0.00 0.00 0.00 cdfhilmrstw - rhel10x3
192.168.122.152 63 61 7 0.61 0.30 0.11 cdfhilmrstw - rhel10x1
192.168.122.242 49 89 2 0.13 0.09 0.08 cdfhilmrstw * rhel10x2

# force-able shutdown of all 3 VMs are as close to same time as I could >

% virsh shutdown rhel10x1 --mode acpi & ; virsh shutdown rhel10x2 --mode acpi & ; virsh shutdown rhel10x3 --mode acpi &

# restarted 2 of these VMs >

% virsh list
 Id   Name       State
--------------------------
 6    rhel10x2   running
 7    rhel10x3   running

% escurl /_cat/nodes
192.168.122.177 57 82 73 1.17 0.35 0.12 cdfhilmrstw - rhel10x3
192.168.122.242 61 83 80 1.62 0.49 0.17 cdfhilmrstw * rhel10x2
% 

EDIT: I misattributed a quote to @Christian_Dahlqvist which was actually @leandrojmp - corrected and apologies.

1 Like

If you have a 3 node cluster where all nodes are master eligible and it is running fine 2 of the nodes should be able to form quorum. You can take one node out and have it operate fine irrespective of whether you take out the current master or not.

I therefore believe it is reasonable to expect 2 nodes to be able to form quorum when 2 of the nodes have been upgraded to the new version and start working. If this is not the case I do not see how rolling upgardes without downtime are possible.

It is however hard to follow exactly how the upgrade has been performed and what changes that have been made in this case, so I would not rule out some error in the process causing this. If I was moving away from Searchguard and upgrading I would do so in two separate steps to make troubleshooting and analysis easier. I would first remove Searchguard and verify the cluster is operating OK and then perform the upgrade.

1 Like

Blocker is a highest priority task in my bug list.
Why I am emphasize on forming two node cluster because I am working on clusters almost more than four years. Every time I upgrade then two node cluster always forms.

Process is simple.
I have (ES8.15.2+ searchguard) three node cluster running . I have upgrade two nodes which usually form cluster if I start upgrade of two nodes at same time of 30s difference.

but if I start upgrade of 1 node then start upgrade of 2nd node after 5 minutes then this problem occurs.

Yamls of three nodes already attached.

I can understand, it is very hard to troubleshoot just with yamls and description about the issue.

Your priority is of little concern to others on the forum, and you have not yet demonstrated a bug, not even close.

I’ve been working on various clusters (in generic sense) for 35+ years. Does that trump your 4? No, because both are completely irrelevant to the topic :slight_smile:

Well, pretty much unreadably, as you have not managed to format posts correctly yet. And 2/3 of those you did attach included the cluster.initial_master_nodes setting, which was an error.

Usually is supposed to be good?

I’ve no idea what role searchguard might play here, but please remove it from the setup and try again. And for doing any upgrades, please follow the upgrade documentation, which is clearly a “one node at a time” style in your situation.

If this is a test system, then I suggest:

  • start over, wipe what you have
  • deploy a clean 8.15.2 on your 3 nodes, say node1/2/3, with node3 in your case as master-only
  • give better node names for clarity
  • after cluster is formed completely, make sure to remove cluster.initial_master_nodes
  • Elastic recommends to upgrade master-only node first, so start with node3
  • wait for 3-node cluster to form
  • then say node2
  • wait for 3-node cluster to form
  • then node1
  • wait for 3-node cluster to form

Your cluster will be an operable 2-node cluster at several pints during the process.

AFAIK that would be the correct process in your scenario.

If that all works, then try exactly same with searchguard. Thats my suggestion. Good luck.

1 Like

You need to remove searchguard from the scenario and replicate the issue, we cannot provide any insight if searchguard is involved because it can impact in the cluster formation.

If you can replicate the issue using the security settings native from Elasticsearch, the people on the forum may be able to provide any insight, but with searchguard it is not possible to know if this issue is caused by searchguard or not.

1 Like

btw, I have test 3-node cluster at the midpoint of my process, 3 nodes, 2 of which are data nodes, one a master-only.

% escurl "_cat/nodes?v&h=name,ip,role,version,master" | sort
name     ip              role        version master
rhel10x1 192.168.122.152 cdfhilmrstw 8.15.2  -
rhel10x2 192.168.122.242 cdfhilmrstw 8.15.2  *
rhel10x3 192.168.122.177 m           8.15.2  -

All 3 have exact same config file, except the 3rd node also has roles set to master:

cluster.name: mycluster
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/http.p12
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: certs/transport.p12
  truststore.path: certs/transport.p12
discovery.seed_hosts: ["rhel10x1:9300", "rhel10x2:9300","rhel10x3:9300"]
http.host: 0.0.0.0
transport.host: _site_
xpack.monitoring.elasticsearch.collection.enabled: true
xpack.monitoring.collection.enabled: true

@Varinder Which steps do you suggest to try to reproduce your issue?

Hi @RainTown
First of all I really appreciate and thankful to you that you try to reproduce it.

second this issue reproduce in rhel8.9
I was using
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.9

install 8.15.2 cluster , then upgrade first node when it starts looking for other nodes to join then start upgrade of second node.

I have tried on other than 8.9 OS it is not reproduced.

Thats … not the correct upgrade process.

7. Start the upgraded node.
Start the newly-upgraded node and confirm that it joins the cluster by checking the log file or by submitting a _cat/nodes request:

from here

FWIW, I very much doubt RHEL 8/9/10 is going to be significant. I think here the likeliest error is in your understanding/processes.

EDIT: Sorry, without being rude, I just don’t buy that claim. Meaning it’s as likely you are doing something different, or there’s something different with the 8.9 setup. Elasticsearch is just a java application, it runs within its JVM, it cares relatively little about the underlying OS version. And, frankly, it’s a bit of surprise that you throw this tidbit into the mix on your tenth (at least) contribution on the thread?

1 Like

Hi @RainTown

I totally understand the frustration. Even I was frustrate hearing this information from who passed me this information. But this is the information I have. I am also struggling to reproduce it.

You all are my elder brother or younger brothers. Please forgive me if I making things little complex unknowingly.

My starting point:

% escurl "_cat/nodes?v&h=name,ip,role,version,master" | sort
name     ip              role        version master
rhel10x1 192.168.122.152 cdfhilmrstw 8.15.2  *
rhel10x2 192.168.122.242 cdfhilmrstw 8.15.2  -
rhel10x3 192.168.122.177 m           8.15.2  -

I upgraded first node, waited.

% escurl "_cat/nodes?v&h=name,ip,role,version,master" | sort
name     ip              role        version master
rhel10x1 192.168.122.152 cdfhilmrstw 8.19.3  -
rhel10x2 192.168.122.242 cdfhilmrstw 8.15.2  -
rhel10x3 192.168.122.177 m           8.15.2  *

Upgraded second node, waited.

% escurl "_cat/nodes?v&h=name,ip,role,version,master" | sort
name     ip              role        version master
rhel10x1 192.168.122.152 cdfhilmrstw 8.19.3  -
rhel10x2 192.168.122.242 cdfhilmrstw 8.19.3  -
rhel10x3 192.168.122.177 m           8.15.2  *

Now, to demonstrate, I just rebooted rhel10x3. As its still on old release, it now will not be able to join the cluster. But the 2-node cluster is still functional.

% escurl "_cat/nodes?v&h=name,ip,role,version,master" | sort
name     ip              role        version master
rhel10x1 192.168.122.152 cdfhilmrstw 8.19.3  *
rhel10x2 192.168.122.242 cdfhilmrstw 8.19.3  -

rhel10x3 logs

Caused by: java.lang.IllegalStateException: node version [8.15.2] may not join a cluster comprising only nodes of version [8.19.3] or greater

So I upgraded that too and waited:

% escurl "_cat/nodes?v&h=name,ip,role,version,master" | sort
name     ip              role        version master
rhel10x1 192.168.122.152 cdfhilmrstw 8.19.3  *
rhel10x2 192.168.122.242 cdfhilmrstw 8.19.3  -
rhel10x3 192.168.122.177 m           8.19.3  -

Upgrade complete!!

Now actually, elastic recommends upgrader to master-only nodes first, so if it were my system I’d follow that recommendation.

There was no cluster downtime. No point where there was not 2+ nodes in a functioning cluster.

There is no way I am investing any more time to spin up a RHEL 8.9 3-node cluster, as I sure I’ll just see exactly the same.

1 Like

Trust, but verify!

I believe the information you got is … not of the highest quality.

There’s nothing really more I can add on this thread.

I suggest to just follow correct upgrade processes, with valid config files, and without 3rd party plugins. 3rd party plugins are supported by the 3rd party.

I agree with you point.
But your two node cluster forms.
% escurl "_cat/nodes?v&h=name,ip,role,version,master" | sort
name ip role version master
rhel10x1 192.168.122.152 cdfhilmrstw 8.19.3 *
rhel10x2 192.168.122.242 cdfhilmrstw 8.19.3 -

2nd I am a developer my work is to develop script which can install and upgrade ES with script on different OS.
Different OS by 1 script.
I know manually upgrade is easy.
So only with this OS there was a problem.
Also I am in R&D, so I have some time for testing.

Lastly support I get from this group is really helpful. I know problem was not clear but by discussing with much experienced and ES knowledge rich people in this group really helps me a lot.

Once again, thanks for your help, support.

Forgive me, I know my some times to me it takes some time to clarify thing.

Really respect you all for your patience and understanding.
Thanks once again and love to all ES family whole helped me in this topic!!

1 Like

Well, yes, though I (almost) followed the correct process correctly.

You might not want to re-invent the wheel, at least not from scratch. Consider to open a new thread to ask if someone already made the effort you are now making.

Well, that point/report is answered already. Trust, but verify. I think you’l find report is erroneous, RHEL 8.9 will work fine too IF you follow correct process, and IF that cluster is itself correctly setup.

Well, in answer to maybe first or second Q asked you might have volunteered “Sorry, I. am trying to do upgrades with scripts. not commands”, and maybe shared some of these scripts. It would have been more helpful than the irrelevant “company policy” stuff.

I wish you good luck with your project.

2 Likes

Hi Folks,
Thanks for being with me and helping me out. Issue is resolved. Root cause was missing indices directory in ES8.19.3’s path.data ( ES Data)
I have created that folder, cluster forms Successfully.

Even I have tried by migrating Data from Old ES8.15.2 cluster to new ES8.19.3.
By deleting data from ES8.19.3’s data path then migrating from Old to New Cluster. It was also working fine.

But in this case I was working to remove searchguard data but meanwhile, missing dir solution clicked in my mind and it worked.

Thanks for rotating bearing of your minds with me. Sorry for making things little confusing.

Thanks.
Issue resolved finally.

Never directly delete or modify data under the data path as this can cause severe issues and data loss- always use the APIs to remove indices and data.

1 Like

I am glad you found your error.

This is your general problem.

Not really following, looks like a very basic misunderstanding, see @Christian_Dahlqvist comment.

Some brotherly advice — take a moment to think about whether you might be trying to run before you can walk. Elastic offers various training courses, and I’d recommend taking some (there are even free ones!). I’m doing a few myself, in fact, and I’m considerably more experienced than you appear to be. Without knowing all the details, it seems that the work you’re attempting might be better suited to someone with real-world operations experience. People with that kind of background will probably spot a few red flags in this thread!

1 Like

Hi @RainTown

Thanks for advise. I really appreciate it. I have started watching ES you tube videos , I found a link. I am a jr Dev and my work is checked by Two senior engineers and I am not directly working in Production.
But improvement is the key.

Meanwhile creating missing directories in Datapath for minor version upgrade is good practice ?
What better steps/precautions should I take?

You should not mess with any files under the data path. Why would directories be missing?

When I have upgraded ES8.15.2 to 8.19.3 indices dir was missing. Not sure why ?
I have manually created this dir cluster forms successfully.