Hi again forum.
Iḿ trying to grow up my baby cluster into something more capable, but every intent on modifying the current deploy leads to disaster.
As starting point, I currently have a master/non-data node plus two data only nodes...it works, somehow sluggish, but works.
Logstash is injecting logs intto the cluster through the lone master node
From there, Iḿ trying to grow up:
Attempt one
I have joined 4 extra VMs to the cluster... enabled them as master/non-data:
Result, is old master becomes not the master, logs cant be accessed (on head plugin I see data on the data nodes, but numbered logstash-master-xxxx-yyyy disappeared)
Logstash (which output points to original master) cannot insert data into the EL cluster... the whole thing crewed again.
Attempt two
I have degraded the new nodes to non master / non data ... so at least I could balance them as query nodes with nginx... this works... but I note the cluster becoming slow.
Questions:
Does logstash write output only works against MASTER elected nodes? or can any node can ingest the data? ... the problem is that by reading de docs it is unclear to me what are the tasks of the master in the cluster, since by my experience, it does more than just be a cluster director.... but the sole data entry point
while any node could handle the ingestion load, it is recommened to use query nodes for ingestion/query. Masters are meant for cluster/admin operations like allocation, state maintenance, index/alias creation, etc. why u want more masters? 3 shud be enough.
Master nodes are responsible for managing the cluster state, and having dedicated master nodes is a good way to ensure that they can concentrate on doing this and not be affected by other activities on the node. It is recommended to have exactly 3 master eligible nodes so that a majority can be formed in case of a network partition. Only one of these will be active at any time while the other are ready to step in if needed.
Having said that, you should not send traffic through dedicated master nodes, but instead directly to one of the data or client nodes. Any of these will be able to handle any request. In order to increase the capacity and/or performance of the cluster, you are therefore better off adding data nodes than large number of dedicated master nodes.
aha... good to know I can send the data not to master node but directly to data nodes...
I wonder then how they then order data in the correct way... since data may arrive from logsatsh-forwarder in a way the master will find not optimally balanced acress the data nodes... hope data nodes can interchange the data directly without messing the master node...
On the otre hand... havin a number of htree eligible/candidate masters is ok... but I have 2 storage nodes plus 5 workhorse nodes... so I'm starting to beconfused on how parameters like:
discovery.zen.minimum_master_nodes: 1
...should be changed in code.
discovery.zen.minimum_master_nodes: 3 to everyone?
discovery.zen.minimum_master_nodes: 3 just on master candidate nodes and 1 on data/search nodes?
discovery.zen.minimum_master_nodes: 1 to everyone?
I'm aware this setup is crucial in order to everything take shape.... but still I do not get the picture clear in my mind
Thank you very much for your help! Best regards!
EDIT:
I'm confused with the parameters regarding "N eleigible masters to became operational"
Non master eligible should then "see" at least 2 out of three in the worst situation...a master candidate is dead.
Master candidates should then be put to 2... in order to ensure they are not the "isolated one" in case a network failure or wathever...
Maybe I'm complicating it in my mind but I see that parameters critical.
Also I'm worried about gateway / discovery setup... the more I read the more confuse I am... once the matter grows, I'm lacking a cluster howt! hope you can help me
Operational but insufficient setup (starting point):
1 master node (data false, master true) plus 2 data nodes (master false, data true).... 1 replica, so data seems to get mirrored across de two data nodes
master server is receiving the data from logstash, it hosts Kibana on nginx, answers all the queries , and masters the cluster... it seems it is clearly too much... (but this is basically what a noob reads on every howto around )
So I have summoned four aditional VMs (comparable in CPU/RAM to the actual master)... they cant be used as data node due to their limited storage.
so, a total of 5 workhorses plus 2 data nodes are the resources.
Keeping a replica of data to prevent a data node failure is a must.
From there... there is room to fly imagination.
logstash - logstash-forwarder pair has been a great deception and Im no longer use it... instead logstash - redis pair is what I have on mind... i want something definitely robust, stable... I'm oldschool Debian styler!.
The reason on 4, is because 4 are the the routed network segments... I planed to have at least a local node supporting every segment... so loosing connectivity on one segment may still left at least 3 nodes.
My original idea is to have all four nodes to have redis, logstash and elasticsearch:
-redis to handle incomng logs
-logstash to parse, process and send directly to data nodes (i have learned here this is possible! wowww)
-elasticsearch to THIS IS MY DOUBT... act as 4 master (5 in total)? act as 4 masters (plus a searc one)? have 3 masters + 2 search nodes?, let some master candidate nodes also act as search server?... many options, many doubts...
Still the main problem is understanding how to use the configuration options correctly... so at least trial / error is feasible as learning process.... but right now all I do is shooting in the darkness.
Adding in those extra master-sized nodes won't really do much.
You could set it up so you have 3 masters (data: false, master:true) and then 1 client (data: false, master: false) and this would allow you effectively scale your data nodes easily.
Then can you set minimum masters to 2 and not worry about changing it, 3 masters will easily manage a large cluster.
I may rely on just one ( master: false / data: false ) server (the one I currently have)
Then, add 3, and just three littel VMs spread around somehow, to let them act as masters as their sole purpose. The correct setup is N=2 for the minimum eligible servers.
Is N=2 also on the data nodes?, is N=2 also on the search/client node?
OK.... but there are still some fog over the picture:
Regarding data delivery from logstash to elasticsearch... whis is the right way to do this?... since there are plenty of options.
Could every services server (log generator) to have a local logstash (resources do allow for this) to perform logfile read, log parse/trasform and index and directly inject to an ES data node?
Or may be is better to have a single logstash instance (redis helped?) to deal with the incoming log flow, parse/tranfor/index it all, and deliver to a data node?
My experience with a single logstash instance being fed by some 20 sources is very very bad... tcp connections from routers work s good, but logstash-forwarder data from 8 servers is simply unusable... very poor results... This is why a thoug on redis all the time...
You don't need 4 master nodes, 3 is enough.
The number you set for minimum masters needs to be the same on all nodes, it's the number of master eligible nodes, so that doesn't count client or data only nodes.
As for the second part, that's up to your requirements and environment.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.