Elasticsearch deployment architecture using laptop and a NAS (VM)

Greetings,

I want guidance as to following architecture. This is for my studies and is not in enterprise production use. I, however, wish to leverage this as a long term deployment. Hence I am running the architecture through experts within the community.

I am hoping to deploy Elasticsearch on two nodes: Primary node will always be powered on with data (storage is running in RAID-0) and network (dual NIC) resiliency. This node will only have elasticsearch installed along with logstash. This node has an Intel Celeron processor with 4 GB RAM running Ubunutu server 18.04.4 LTS. Logstash is configured to use 512 MB RAM. All the log source will communicate only to this (primary) node.

The second node is my laptop, which will be part of the cluster but will be shutting down as and when I am not using the laptop. This node has i5 with 32 GB RAM and is running on Windows. It will have Elasticsearch and Kibina installed. I need this to ensure I can work with data once I am in class.

Whenever a new log source is added, the laptop will be available for the log source to create Kibana Index. I have made sure all Index (indices) have single shard with multiple two replica's.

I am doing this primarily because, as a student, I cannot afford a dedicated system with high compute and memory for ELK deployment. Secondly, as part of my research and presentations, I need the data in a mobile form (without the need for VPN all the time.)

I have only a few log sources to play with, and these include Raspberry Pi's sending Syslog running different services such as DNS, VPN and SSH honeypot.

I have attached a deployment diagram if it helps.

I would kindly request members of the community to highlight if they see major resiliency faults in the system and mitigation if feasible.

I will be using latest version of all the software's installed.

Thank you very much!

This isn't an ideal setup, but I can appreciate your position and it should work.

You don't need RAID0 or a bonded NIC - unless you want to set these up to see how they work, it's overkill for this.

1 Like

You say that you want to be able to use the laptop's data when you are in class, which implies that it would be disconnected from the other node. That's where this idea falls down I think: you cannot split a cluster into two parts both of which continue to work independently of the other. If the laptop's node is not master-eligible then it won't serve any data without being able to contact the master.

I think you would find it simpler to run two independent one-node clusters on your two machines, with a process to sync data from the always-on one to the one running on your laptop. You could, for instance, periodically take snapshots of your always-on cluster and then restore those snapshots onto your laptop's cluster as needed.

OTOH if you do expect to have connectivity between your class environment and your always-on node then I would recommend not running a node on your laptop at all, and accessing your always-on node directly when needed.

3 Likes

Hi, thank you for your response. I would like to clarify (and apologies, if you understood it and I am still trying to clarify.)

This is a single cluster with two nodes. NAS being primary and laptop being secondary. I am using the laptop only to take data that I can use offline whenever I cannot connect to the primary node. None of the log sources are pointing to the secondary node and details of this node (IP or hostname) isn't published to any of nodes sending data. Using this crude manner I am forcing logs to go directly and only to NAS which is always on. I am using the laptop to make a copy by node coming online. Post synchronisation, I can work on this data in the class wherein I have limited connectivity.

If your answer remains the same, I will research on how to carry out the workarounds you have mentioned.

Edit: Would you recommend using the laptop as a "remote" cluster?

Elasticsearch doesn't have a notion of nodes that can be used "offline". You need a separate cluster for that.

No, at least not in the sense that the manual uses the term "remote cluster". A remote cluster is only useful if it is connected to the main cluster. You want two completely independent clusters.

Thank you for this.

As part of my masters degree in cyber security, we have to do threat models and resiliency design. So these both are part of that. :smiley: :stuck_out_tongue_closed_eyes:

Thank you very much. Let me find a suitable way to get this done. I will write my solution at the end. Appreciate the immediate assistance.

1 Like

Further to this post, what can be inferred from indices having different size on disk? Is it because of difference in file system? As primary is on Btrfs and secondary is on NTFS.

Is there a way to compare primary and replica indices?

@DavidTurner @warkolm (Pardon me for tagging you both directly. I wanted to get a response so I can move on with the research. Thank you.)

Doesn't look abnormal to me, and these numbers are independent of the filesystem. There's lots of different ways of representing the same set of documents as a collection of files on disk, and the different shard copies make no attempt to choose exactly the same representations.

Thank you very much @DavidTurner. Final question, is there a way to compare two indices? Number of documents in each maybe?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.