My team and I are trying to improve a cluster that we inherited. I apologize for the following vagueness, but due to the location of the cluster I will give as much detail as possible. We currently are working with a 7 node cluster all data nodes are dell r640s and we are on version 7.6. We also have three logstash servers inputting data into the cluster. There is a small amount of pre and post-processing happening on the logstash servers, but all of the data nodes are all still set to default. So, they are all still dilm nodes. There is a huge ingest pipeline on the cluster that is doing most of the processing and mapping of the information that is coming into the cluster.
I have been doing research and have been told that it's not ideal to be using logstash and ingest nodes. This may be true, but offloading all of the ingest processing out of the cluster and onto the logstash servers seems like a pretty major undertaking to perform on a production cluster. We are currently performing about 20,000 indexing operations per second within the cluster. I was hoping to bring two more nodes online and make them indexing only nodes. While I am at it I would like to make only 3 of the data nodes master eligible, and for the time being I am going to take machine learning off of all nodes as it is not currently in use.
I was wondering if you guys think that this is a decent or horrible idea. I am also open to any other suggestions that you believe would make this cluster function more efficiently.
I appreciate the response. I have spent the day talking to one of the developers for the system and am now confused about the setup of our system. I am going to hold off asking any questions until I get a little more clarification. Thanks very much for your time.
Ok so I have some more information on the cluster, and have a clarifying question. I have been discussing the design of our cluster with one of the devs, and most of the unstructured data that comes into our cluster is processed by Filebeat which then pushes the documents into our cluster. Which is fine, but by default all of our nodes are still ingest nodes and there is a roughly 12,000 line ingest pipeline within our cluster state.
Would that mean that all of the data is getting processed multiple times? I know there are bulk indexing operations running frequently on our cluster. From what I have read if you have an ingest pipeline on ingest nodes in your cluster it will intercept the bulk index run the data against the pipeline and then index. Am I way off here?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.