The scripts that I use for entity-centric indexing [1] sort content in a source index by a common key and consolidate multiple docs into a update on a single document in the target index. The "pull" from the source index and the "push" to the target index are both done using the respective bulk APIs.
Your use case is slightly different in that you want to insert a single doc in the target index rather than update one but you should be able to adapt the included python script with few changes.
Just a suggestion, come up with a way to create a key from the data in the log entries and then use create requests to save the data into a new index under that key. If there are duplicate entries then only the first will be indexed and the duplicates will be dropped. Unfortunately, I don't know if the reindex api supports this.
There's no easy way to do that. You'd still have to walk the entire index and figure out what are duplicates then delete them. It will be a lot simpler to simply reindex and drop the duplicates in that process.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.