I have this elk stack running, 8 datanodes, 4 ingestnodes, 4 logstashes en a shitload of applications that send data from multiple servers via filebeat. The size of these applications vary from 4 servers to 24 on this location (the other half of the apps are on another location and send data to their own elk stack). My logstashes run 4 pipelines each, ports 5044 to 5047. The busy apps have their own pipeline whileas the smaller ones share a pipeline.
The biggest and most important index receives around 531,746,569 documents a day with more pressure late in the afternoon and the evening and less pressure in night and morning.
Now sometimes there are peakmoments in which too much data comes in at once. Or sometimes a network or storage event occurs what causes a hiccup. The stack then needs some time to catch up again, but the load of docuemtns that is being send is too high to do this quickly. So in the end I always have a gap of non-recoverable events.
I would like to prevent this. Now I learned people use redis to solve this. But what would be a decent setup to do this?
Should I set a redis in front of each logstash? (Am going to use redis at first only for the pipeline on 5044). So that this means: filebeat--> 1 of 4 redis --> 1 of 4 logstash --> elasticsearch
Should I put Redis Behind logstash? filebeat--> 1 of 4 logstash --> 1 of 4 redis --> elasticsearch ?
Maybe even both? filebeat--> 1 of 4 logstash --> 1 of 4 logstash --> 1 of 4 redis --> elasticsearch ?
Is 4 redis overkill? Or undersized?
And since I understood Redis is a in-memory DB, what would be an appropriate size of Ram?
Are there other (better) options than Redis to consider?
@Tuckson As @warkolm said Kafka is the typical tool of choice we see today for this.
Often looking like this / This is the typical "base" architecture
Source -> Kafka -> Logstash -> Elastic
Although there are variants depending on the scale, "spikeness" and reliability desired
Source -> Kafka -> Logstash -> Kafka -> Elastic
How you divide up your topics and scale tends to be use case specific.
Me if this prod I would run a minimum of 2 kafka brokers (although I am and n+1 believer so I like 3) in case I lose one I have 2 still running.
You can map topics to sources or have just 1 topic, in the wild I see folks tend to do some form of mappings of sources to topics for a bit more granular control... Perhaps you could start by just taking that big index / source and feeding it through kafka.
Perhaps you might end up mapping the topics to your pipelines , could be a natural fit.
We actually have some docs on it and here and quick search you will find lots of articles because its pretty popular
Of course the downside is.... more to manage... but the payoff could be a more stable pipeline and happier customers.
BTW there is a fairly new setting that I am just learning about here... regarding indexing pressure
Thnx. For the first iterations I will just add it to 1 pipeline. The others will for now keep connecting without buffer.
What I am wondering... To try to improve the pipelines I started using dedicated ingestnodes. The only serve as entry point and do not have dataprocession stuff added other then what Elasticsearch makes them doe out of the box.
Does this make sense or can I as well take them away and let the data come in on the other nodes?
edit:
Another question: is kafka abla of pushing data? (My experience with message queues is limited, Have only seen situations where the target application is fetching data from a message queue and not get messages pushed by the queue)
I have seperate coördinating node where my kibana and grafana point to. And yes, the sole reason I have these ingestnodes is an attempt to master the peaks during high load. I can have situations where my load is doubled in minutes and this often led to delay and eventually outage and dataloss.
I can't stand it because we also have a splunk cluster running (Is a bit of a competition ) and they seem to have no problems while not exactly having been tuned for performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.