Set the document_id based on the userid (either as-is, or using a fingerprint filter) and then duplicate actions for the same userid will be overwritten.
In that case use an aggregate filter to collect all the unique actions for a userid (using a Ruby set, perhaps). If the input is not sorted it would be like example 3, if it is sorted by userid then example 4. Note that this means logstash will be holding the entire input file in memory until the timeout triggers.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.