We are facing some issues with elasticsearch. We are having lots of duplicate log entries like below. Inside logstash.yml there is only a file in path.config that we use: path.config: "/etc/logstash/pipeline.global.conf"
Thank you for your reply. The origin of logs are inside shared drives but they are not duplicated. We use filebeat and logstash with the following pipeline configuration:
if [client] == "iis" {
if [indexname] {
elasticsearch {
hosts => [ "https://xxxxxx:9200" ]
index=> "rq-%{[client]}-%{[indexname]}-%{+YYYY.MM.dd}"
Logstash and Filebeat can have issues reading from network drives, which is not recommended. This may very well be why you are seeing duplicates.
There could also be other reasons, e.g. issues with your Logstash config, but it is hard to tell without more details around config and how frequent the duplication issue is.
In that case you are right. But in another example, we checked the log file and there wasn´t duplicated lines. In this case, only the _id and the ingest time are different:
If you are sending documents to Elasticsearch in bulks and do not specify custom document_id, you can not guarantee exactly once delivery.
When ES cluster is busy it might "reject" indexing requests (so-called back pressure). In case of bulk requests, a coordinating node will split the request into smaller "sub-requests" (1 per shard) and will send all requests in parallel. In such scenario, some sub-requests can complete successfully while others might fail resulting in partial rejection. Your logstash might retry the whole bulk though.
If you provide your own document_id, it should fix the duplicate issue but it will affect you indexing performance.
That is a significant difference in ingest time, so suggests it is not due to retries when indexing into Elasticsearch. I would recommend searching for other log entries from that file around that time and check if those also are duplicated. If it is the entire file it would seem like something happened at the file system level, causing the file to be reprocessed, but it is hard to tell without being able to investigate the data directly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.