I have 2 pipelines to see (in Kibana) the most recent requests within a platform. Each request is a json that has a couple of fields and:
req-id
req-timestamp
My pipelines:
The first pipeline (pipeline_temp) populates the "temp" index which receives all the documents in realtime;
The second pipeline (pipeline_main) populates the "main" index; it is scheduled to start every 10 minutes, it has input the temp index and for each document it checks that there is not a document with the same req-id with a greater req-timestamp and empties the "temp" index.
In my "main" index I currently have about 12 million documents and I see that in the main index there are also req-ids with different req-timestamps (not just the most recent).
The loading of these documents seems to be random, the pipeline seems to work correctly 80% of the time but about 20% fails.
Could it be a data ingestion delay problem? Maybe the main pipeline checks if there are req-id with more recent req-timestamps, but if the document has not already been ingested the check fails
I dont know why fails... this is the purpose of my topic
Yes, req-id is unique value; main index increase every 10 minutes, when the pipeline main runs.. but now I have 12 mil docs.
And no, I'm not using ILM (elasticsearch use the default value).
Is there any error in /var/log/logstash/logstash-plain.log?
With temp index, you try to avoid duplicated records based on unique req-ids?
If req-ids=12345 and req-timestamps='10092022' in index, and temp index get req-ids=12345 and newer req-timestamps='11092022' , will be update of full record for req-ids=12345 or just req-timestamps in main index?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.