I am using logstash aggregate filter to aggregate two log lines with same "uuid"
But in the documentation, it is mentioned that " You should be very careful to set Logstash filter workers to 1 ( -w 1 flag) for this filter to work correctly otherwise events may be processed out of sequence and unexpected results will occur."
Since my system has a considerable traffic I am using the default number of workers " Number of the host’s CPU cores"
Because of this, I have found out that most of the logs were not properly aggregated.
Do we have any alternative method to execute the aggregation functionality by keeping multiple workers?
The aggregate filter indeed has this limitation, which limits performance considerable and prevents scaling to multiple threads and Logstash instances. To get a solution that scales it is probably better to have a solution that does not rely on the ingest layer to handle this.
One option could be to have a batch process that periodically queries new data and updates documents where needed. This would typically run externally to Elasticsearch and be implemented using one of the language client.
You could also create an entity-centric index where you store a single document per UUID (and use this as the document ID). When you find a document that should be aggregated, you update this document (first time it would be indexed) while at the same time writing the document to the standard index.
I could not find a better solution with multiple workers, So I move in to a multiple pipeline solution each with single worker. This system is working very well with high transaction load.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.