Handling duplicate data in Logstash + Elastic Search


(Patrik Iselind) #1

Hi,

First some background:
I have an ELK stack running in an docker compose environment, so far for learning purposes. Logstash gets its input from a AWS S3 bucket and sends its output to the Elastic Search server. The processed files are not removed or moved to another bucket, this is nothing we want to change.

My problem/question:
If for what ever reason the Logstash instance crash or die in any other way. Then it would probably process all the data in the AWS S3 bucket all over again, generating duplicate entries. This because the since file in the previous logstash instance is lost on crash/death.

Is there some elegant solution to this duplication?

Would it be sufficient to volume mount the since file in the docker image to stop the duplication?


(Patrik Iselind) #2

If volume mounting the since file would be sufficient, would initiating it to an empty file initially be good enouch to get things started? From what i've seen the since file isn't there upon first start, it's created as a step in the startup.


(Magnus B├Ąck) #3

You should definitely store state files like sincedb persistently, either in a persitent volume or mounted from the host. In the latter case I'd mount a directory instead of a file to avoid having to care about the question of what happens if the file is created implicitly when you mount the file into the container.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.