I have a EFK cluster with version 6.8.2 and everything was running smoothly until I had a requirement to create specific type of index (project-large_payload-$namespace) based on a field length.
For this, I used script_processor to create an ingest_pipeline which would check the length of my field and update the document accordingly (i.e. set ctx._index=project-large_payload-$namespace*).
The pipeline is working fine in itself and I can see new index being created with logs where field.length > 10Kb are being pushed.
However, I can see in default indices (project-$namespace-* indices) , the same document is present, i.e. the docs were just "copied" across and not "moved".
Any idea how can I restrain filebeat to push the docs only to large_payload index ?
As an update, I have resolved the issue by putting an additional processor on my pipeline and I don't see any duplicates coming in. However, there is another log duplicacy seen in the environment.
When the filebeat pods are restarted, they tend to read all the log files from configured path again from HEAD and thus push duplicate logs to ES.
Imagine if I have 100 pods running on my host machine for about 2 months and I restart filebeat on that host, it will take atleast 1 full day to first process all the logs (duplicates) and then tailing recent logs (already backlogged for a day at this time).
Is there any way to stop reading the log files again on restarting filebeat and thus avoiding duplicate logging ?
My filebeat-config.yml is:
Maybe this happens because the registry is being removed along with the pod, and when the pod starts again doesn't know the previous state. Maybe tail_files can help you here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.