This is the second time im posting this and i hope someone can provide some feedback on the reason for why data is slowly being available for me in Kibana. I have data that is supposed to be near real time but rather its 5 hours delayed.
Here is my data flow:
source (metricbeat) 50+ server -> Kafka -> 2 LS nodes (no filtering) -> ES My ELK cluster consist of 4 data nodes, 2 coordinating nodes and 3 masters. Using 5.4.1 across all my cluster including beats.
My issue starts when I introduce ES as my LS output:
Data will flow nice for few mins then its super late to a point where it becomes 5 hours behind.
Also, I will start seeing, in LS logs, lots of Kafka commit offset failures and partition re-assignment.
Looking at the file output on LS I can see that data is still way behind, unlike when ES is NOT configured for output.
Even before using Kafka the same problem was there.
Once I take out ES from the picture and I start saving data on LS local files its super fast and keeps up with no matter how many records Im polling from Kafka, and all up to date.
I have tried many possible solutions, increased Kafka/LS input timeouts reduced the record numbers on LS to 1000 rather than the defaults and still ES will only give me few secs worth of new data, under discover in Kibana.
My questions are:
is the number of my data nodes enough ?
is the number of CPUs 6 per data node enough ?
why when i use the JSON codec on LS then data starts being very slow in ES/Kibana, but if I keep it plain-text it seems to be able to keep up ?
I appreciate any help here as we are going production soon and this gating.
indexing is reaching up to 2k/s
I tried 2 scenarios one index for all servers/events (one ES output per LS) or up to 5 indices (multiple ES output plugins per LS).
The 2 LS nodes are sharing the load each is processing about 400-600 events per second.
What type of disk on your data nodes? Spinning or SSD? Remote file system?
local not ssd and not remote everything is running on its own vm.
How many indices are there right now?
if you mean in total then we have very low number 17 mostly from 2 days testing (excluding system indices)
How many of these indices/shards are you actively indexing into? Are you assigning your own document IDs or are you letting Elasticsearch assign them automatically? Have you tried updating the refresh interval from the default 1s to e.g. 10s in order to reduce merging by creating larger initial indices?
Thats what I dont see, nothing is in the logs!
Do I need to pump up my log level ?
As for the heap we have assigned 12GB for the heap out of 24GB RAM
The heap usage some times is reaching up to 60-%70%, I see that under monitoring on Kibana.
Does that play a part in data delay ?
Also, we are sending data to the ES cluster through 2 coordinating nodes is that ok ?
In case anyone else come across such issue here is how I was able to resolve it.
After doing lots of testing and researching I was able to tweak 2 things to over come this problem.
Increase the default LS batch size i.e 125, up to a number you think your LS nodes will keep up with it, something like 2000-3000.
increase the number of pipeline workers. Again, increase the number based on system specs.
you may also need to tweak the Kafka input plugin timeout values to make sure data has enough time to make it through all the way to ES before Kakfa session times out.
changing the above has resulted in:
no Kafka/LS errors in the logs
ES keeping up with data fast and in a uniformed pace.
steady stream of in/out data on the LS nodes.
Please note that you need to keep an eye on your JVM heap and your CPU utilization, as you may need to change that to achieve the desired results using the above approach.
Thank you for sharing your experience. It will hopefully provide answers for others in the future. I'm glad you got things to work. I feel that a few points should be made, though.
While I'm glad that this change worked for you, it is not recommended in every situation. As a matter of fact, it can be a very bad idea in many situations. Please test your situation individually before just accepting someone else's use case. See here and here for our authoritative (but not exhaustive) recommendations for approaching those settings.
The default is usually to match the number of cores reported by your system. If on a VM, this number may need to be manually tweaked. The default is usually acceptable if on bare metal.
I appreciate your response and the info you have provided/referenced, that should be very helpful for us in judging performance based on the changes.
Once we have done the changes we took in consideration our LS and data nodes CPU utilization and JVM heap as 2 main points of judging whether this change is good or bad.
Looking at CPU and I still see low utilization 20% or less per LS nodes, but heap is hitting close to 50% which we can increase as we have very low space allocated for now, for testing purposes.
Also, I would like to mention that we have been debugging for the past few days/weeks and there was no way we can catch up with data coming from Kafka with defaults and 2 LS nodes; If we plan to have ES as our output destination. We were always hours behind, very big lag, but if we are only using local file system on LS as our output then maybe we can survive with the default settings on LS.
Please don't mistake my meaning. You have done the right thing for your use case-for your Kafka broker, particularly. It's just that that is not the best solution for all "Logstash is backed up/not catching up" scenarios. I just want future readers of this thread to understand that.
Of course, you are right! Not all scenarios are the same.
I was just making sure to point out that we have looked at certain things to make sure we are not abusing the settings nor the resources we have.
For others to understand the summary of this problem, based on the debugging we did and Kafka being present in our solution we had to go with this approach for many reasons. Each other scenario may require looking at and debugging different parts of the cluster and use different solutions/settings, even if you see similar symptoms.
We are only experimenting right now, even if some think it's production. We have some old, fairly large, physical servers for testing, lots of CPU and RAM on my 2 logstash nodes, so I put ES Ingest nodes on each LS node. That seems to really offload my data nodes for ingest processing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.