I'm querying an index with Nginx access records, running it thru the Aggregate filter plugin and writing it to a file. However, the elasticsearch input plugin is not reading the records in @timestamp order - they are written in a random order.
Furthermore, the logstash job has been running for over 4 days on an index that has 17 million records and 11.2 GB of data. I wonder if it's timing out and starting from the beginning again.
Of note, I always get this error (even without the aggregate filter)
[WARN ][org.logstash.instrument.metrics.gauge.LazyDelegatingGauge][mediaserver_access_TEST] A gauge metric of an unknown type (org.jruby.RubyArray) has been create for key: cluster_uuids. This may result in invalid serialization. It is recommended to log an issue to the responsible developer/development team.
I'm running Logstash 7.4 (and reading from Elasticsearch 7.4). I've now set the following in logstash.yml and it's the same problem. It's processing 5 days worth of log files... it gets thru 5 days of records in @timstamp order (~33,000 records out of 17.1 million) , then loops back to the beginning and processes another 5 days/~30,000 records, etc.
i would create a directory inside your /etc/logstash called sincedb
then add the following to your input section of your logstash .conf in conf.d folder
this way, it should only read to the last recorded point ( in that dat file you'll see a timestamp ) it won't re-read this whole file, but only events after the time stamp.
So another trick for ingestion from beginning is to also set in the input section of your conf file start_position => "beginning"
so that is knows to start at the top and work its ways down to the end. once its completed and change the "beginning" to "end" and it will only ever start from the end of the log file. You can also mix and match start_position and sincedb in the same conf, it just depends on what your aim is. Experiment and have fun!
@Badger You mentioned that, "...you may have an elasticsearch question rather than a logstash question." Is there any way for me to test and confirm that this is an Elasticsearch issue?
The logstash-plain.log file doesn't show any issues. But it's only reading a small percentage of the index and going back to the beginning to read more records. Not sure if that's what you're asking.
If the pipeline restarted it would re-run the same query and start fetching the same set of records. That suggests to me that it is being restarted. I would expect that to get logged.
When I said you might have an elasticsearch question I meant that the query might be wrong.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.