filebeat will be correctly consumed in elasticsearch, and when I restart filebeat, it will send the message to elasticsearch again, and there will be an extra record for this, why does add_id: ~ not take effect here
Can you share the duplicates message in Kibana as @stephenb asked?
I'm not a Kafka expert, but looking at your config it doesn't seem that filebeat would consume already consumed messages as the group_id doesn't change.
Could be the producer sending the same message twice to kafka?
Also, what you want to achieve with the add_id processor? This processor just adds a random id to the event, but if it receives the same message one or more times, the id of those messages will be different.
This is in my test environment. I use the producer to send only one message to Kafka. There is always one message in Kafka. When filebeat is running, the message will be delivered to elasticsearch once. When I restart filebeat, filebeat will deliver the message once again. Elasticsearch, during the restart of filebeat, there is no new message in kafka. I want to use add_id to generate a unique id for each message, so as to avoid repeated consumption of data to Elasticsearch when filebeat restarts
If possible please edit your filebeat and remove this field "kafka.offset" from the list of fields that you are dropping to check that it is indeed the same message from kafka.
It is described in the document that add_id processors is to generate a unique id for time, is my understanding wrong? https://www.elastic.co/guide/en/beats/filebeat/7.16/add-id.html
If I'm not wrong it is unique in the sense that each event will have an unique id, but if filebeat process the same message again for some reason the id will not be the same.
For example, if you have a log file with the following lines:
first message
first message
second message
Each one of those lines are one event, and each one will have a unique id, the first two lines are the same, but the add generated by the add_id processor will be diferent.
Hi, this is what I just simulated. Except for the filebeat agent.hostname field, the other fields are exactly the same. Is it because the filebeat agent.hostname field is not the same? Is it not a repeated event?
It just adds a unique ad, it will not help with duplicates, what you want is the fingerprint processor.
This will generate a unique id based on some field of your document, like the message field, so if the message is the same, the id generated would be the same.
But this is not the issue here, the issue is that your kafka input is reading the same offset twice, this should not happen if the group_id is the same, but I'm not sure if the issue is on filebeat or on Kafka.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.