I know this is a very vague question and might have answered a couple of times before, but wanted another pair of eyes to look at and vet?
I am planning to build log management/Security Analytics solution and will be collecting logs from around 100 devices comprises for [Servers/Routers/Switches/Firewalls] which I believe should not generate more than 15-20 GB per day
I am planning with 4 nodes
1xLogstash - 24GB
2xES nodes [cluster] - 32 GB each/ having 3 TB of Diskspace
1xwazuh/OSSEC node for accepting messages from Server sending it to ES directly with 16Gb RAM
Now my queries are -
How many shards should be configured? 5 is enough?
What should be HEAP_SIZE on ES? Considering 32 GB -> 16GB is enough?
Can I install Kibana on Primary elasticsearch node? Or do I need to install Kibana on a different server?
And Kibana will/should connect to the primary node of the cluster?
Similarly, Logstash will send data to the primary node as well?
Considering future growth and shard numbers I can add more ES node in the cluster; right?
Any other optimization tips are really appreciated
1 sounds enough (and this is the default in recent versions). 5 sounds like too many. Use ILM to move to a new index when the current one reaches a reasonable size (say 40GB).
If your machines have 32GB of RAM then 16GB (50%) is the absolute maximum allowed. You may get better performance with a smaller heap.
There's no such thing as a "primary node". Both nodes are equal from Elasticsearch's point of view, and you can send data and searches to either.
Yes. Note that you cannot build a fault-tolerant cluster with only two nodes - you need at least three master-eligible nodes for resilience. But after that you can add data nodes as needed. You might also want to segregate your data nodes into a hot/warm architecture.
If I am not wrong one shard probably will not spawn on multiple nodes, correct? and considering the future growth if data volume increases; if I introduce 2 mode nodes 1 shard will not be enough, right?
Ok I think I see. By "number of shards" I mean the number set by the index.number_of_shards setting. 1 is the default for that and that sounds reasonable for your case. But for fault tolerance each shard must have a replica, and this is controlled by the independent index.number_of_replicas setting. 1 is the default for that too, meaning that each shard will have a copy on both nodes, and that sounds good for you too.
Why do you need Logstash? I think it's a bit pure software because of Ruby interpreter...
ES has ingest nodes since 5.x and supports 30 or so input filters.
Unless you need to enhance IP addresses with GEO data (as I know there is no such ES filter) I don't understand why do you need to waste 24 GiB of RAM when it can be ES ingest node.
Ingest node however I am not so sure about the ingest node can start listening on any port for incoming data from my network devices? From servers it would work since I will be installing shippers but what about Network devices?
Packetbeat is also a library. It supports many application layer protocols, from database to key-value stores to HTTP and low-level protocols. Choose the one you need or add your own by submitting a pull request.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.