The data is being logged to the elasticsearch?
If yes you could calculate MD5 hash out of your result and use it as a document ID. All the duplicates will be recorded in the Elasticsearch in the same document ID, and the only _version will increase.
This is to just avoid duplicate
i do not want pipeline to run unnecessary and keep doing indexing and then consuming lot of RAM and CPU by checking documentid. One it loads 10,000 records it should not load data at all
Hello.
What about your document_id
in the elasticsearch output settings ?
If you don't specify it, each time the request is launched ,it creates new documents.
Give it the unique id you have in the database, it should solve your issue.
I have 2mn records, i am using persistent queue
this is what happening with me
page0 : 0.5 mn
page 1: 0.5 mn (1mn)
page2 : 0.5 mn (1.5 mn)
page3 : 0.5 mn (2mn)
page 4: 0.5 mn ( 2.5 mn)
page 5 :0.5mn (3.0 mn)
...
...
...
This is going in continuous loop, how can i avoid that. I dont have any problem in using document_id, i am already aware of that solution, but how can i avoid this continuous loop?
I want once 2mn is done it should stop going in second loop and i should be able to schedule it for next day
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.