Hello folks! Please help with these quesitons.
I am running a load test which is generating 8-10 gb of real time logs in an hour, what settings will be the best to make sure that no data is lost?
What's the diff bw "stat_interval and pipeline.batch.delay" ?
what number will be optimal for the setting "pipeline.batch.size" in case of the data that i have?
I have read about persistent queues? can i use it in my case and what settings would be optimal?
The optimal batch size depends a lot on the data and the number of indices and shards you are indexing into. As you have not given any details about either it is hard to tell. The same appleies to persistent queue settings.
In general the default settings are quite good so I would start with that and only tweak if you encounter performance problems. Even then the filters and pipeline configuration usually make a much larger difference than the parameters you are now looking to optimize.
I am creating a new index everyday, and my logs are created every hour, so basically all 24hours data goes into 1 index. and I am sorry i don't have much context about shard settings.
The need for persistent queues depends on your data source. How is logstash retrieving the log data? If it is receiving udp syslog data, there is no recovery without persistent queues. If it is harvesting a file or getting filebeat/winlogbeat data, the sender will wait for availability, so persistent queues aren't as important.
You say "My logs are created every hour", does that mean they won't be seen by logstash until the hour passes? If so, you already have a huge lag (avg 30 min) in the data, so quick logging isn't important, reliable logging is.
Standard design for HA, avoid single points of failure. The beats can provide load balancing for availability and load. Logstash can too. Like I said above, things like syslog increase complexity.
Thanks for helping. In my case logstash is harvesting data from a file, where each hour a new file is generated, but as of now i will rely on default settings.
I have one more question. which is bothering me a log. How do i verify that all the data has been parsed by logstash as the index size and file size cannot be compared and data cannot be manually compared as file size is very huge? I have two files types each having a different log format as show below.
Apache Error logs
log file with kv pairs each pair seperated by dash lines.
Verifying ingest may not be easy. It probably involves 2 things, 1) check logstash (and everything else) error logs to catch any condition that is detected and 2) some type of auditing. As far as I know, you would have to create your own auditing. As in financial auditing, you would have to sample the source data and verify that it has been ingested. You could use elastic queries to validate the data, I'd suggest sampling:
the first and last event in every file
random blocks of events from random files
if there are different event types, make sure that you sample events from all types. A rare event type can be missed in random samples
if you preserve the source timestamp, you might be able to audit events per hour, matching the counts in source vs. elastic.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.