I have to keep 2 types of indices on elasticsearch:
1- full, regular index holding all records sent to it (logstash-YYYY.MM.DD)
2- small, fast index holding only the last log message of each kind (lastaction-YYYY.MM)
I need some help setting up the number 2 above.
Currently i'm using the fingerprint filter to generate a hash (SHA1) for each message field, and using this hash as the document_id for the elasticsearch output.
I'm asking because despite not seeing any duplicated records (i think this part is ok)
the index is bigger in size when compared to de full/regular one.
And it should not.
Also, i expected it would only update the timestamp if a newer record was inserted,
but in fact the @timestamp field is always updated, even if it is a date in the past.
I'm asking because despite not seeing any duplicated records (i think this part is ok)
the index is bigger in size when compared to de full/regular one.
And it should not.
It could be larger than expected unless it's optimized to expunge deleted documents. I don't remember to which extent this takes place automatically; check the ES documentation. Keep in mind that ES treats updates like a delete followed by a new document.
Also, i expected it would only update the timestamp if a newer record was inserted,
but in fact the @timestamp field is always updated, even if it is a date in the past.
Logstash will update the document as a whole so it seems very unlikely that it wouldn't update @timestamp too. Have you double-checked that the @timestamp field really has the correct contents when you're backfilling data?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.