Logstash Elasticsearch Input - Query in @timestamp order

I'm querying an index with Nginx access records, running it thru the Aggregate filter plugin and writing it to a file. However, the elasticsearch input plugin is not reading the records in @timestamp order - they are written in a random order.

Furthermore, the logstash job has been running for over 4 days on an index that has 17 million records and 11.2 GB of data. I wonder if it's timing out and starting from the beginning again.

Any suggestions given the following config?

    input {
        elasticsearch {
          hosts   => ["https://m-elsc-006.on4cdn.net:9200"]
          index   => "mediaserver-2019.11.11-000020"
          user          => "elastic"
          password      => "******************"
          size    => 5000
          scroll  => "5m"
          query   => '{ "sort": [ "@timestamp" ] }'
          docinfo => true
          docinfo_fields => [ "_type", "_id"]
        }
    }

Of note, I always get this error (even without the aggregate filter)

[WARN ][org.logstash.instrument.metrics.gauge.LazyDelegatingGauge][mediaserver_access_TEST] A gauge metric of an unknown type (org.jruby.RubyArray) has been create for key: cluster_uuids. This may result in invalid serialization.  It is recommended to log an issue to the responsible developer/development team.

If pipeline.java_execution is enabled (which became the default in v7.0) then logstash will re-order events even with pipeline.workers set to 1.

If you have pipeline.workers set to 1 and java_execution disabled then you may have an elasticsearch question rather than a logstash question.

That WARN is completely normal in recent versions.

I'm running Logstash 7.4 (and reading from Elasticsearch 7.4). I've now set the following in logstash.yml and it's the same problem. :unamused: It's processing 5 days worth of log files... it gets thru 5 days of records in @timstamp order (~33,000 records out of 17.1 million) , then loops back to the beginning and processes another 5 days/~30,000 records, etc.

pipeline.workers: 1
pipeline.java_execution: false

Paul,

It sounds like you need to use a sincedb to maintain track of where your logstash is getting to in its reading of the logfiles.

read this

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#_tracking_of_current_position_in_watched_files

i would create a directory inside your /etc/logstash called sincedb
then add the following to your input section of your logstash .conf in conf.d folder

sincedb_path => "/etc/logstash/sincedb/myfilesincedb_name.dat"

this way, it should only read to the last recorded point ( in that dat file you'll see a timestamp ) it won't re-read this whole file, but only events after the time stamp.

So another trick for ingestion from beginning is to also set in the input section of your conf file :slight_smile:
start_position => "beginning"

so that is knows to start at the top and work its ways down to the end. once its completed and change the "beginning" to "end" and it will only ever start from the end of the log file. You can also mix and match start_position and sincedb in the same conf, it just depends on what your aim is. Experiment and have fun!

The synced_path option doesn't work with the elasticsearch input plugin, I get [ERROR]..."Something is wrong with your configuration."

I'm now trying it without the aggregate filter plugin (see below) - still the same problem.

input {
    elasticsearch {
      hosts   => ["https://m-elsc-006.on4cdn.net:9200"]
      index   => "mediaserver-2019.11.11-000020"
      user          => "elastic"
      password      => "**************"
      size    => 5000
      scroll  => "10m"
      query   => '{ "sort": [ "@timestamp" ] }'
      docinfo => true
#      docinfo_fields => [ "_type", "_id"]
    }
}

output {
    file {
        path => "/var/log/logstash/2019.11.11-000020_TEST.json"
        codec => "json_lines"
    }
}

@Badger You mentioned that, "...you may have an elasticsearch question rather than a logstash question." Is there any way for me to test and confirm that this is an Elasticsearch issue?

Is the pipeline getting restarted?

The logstash-plain.log file doesn't show any issues. But it's only reading a small percentage of the index and going back to the beginning to read more records. Not sure if that's what you're asking.

If the pipeline restarted it would re-run the same query and start fetching the same set of records. That suggests to me that it is being restarted. I would expect that to get logged.

When I said you might have an elasticsearch question I meant that the query might be wrong.

My bad, yes, the pipeline is getting restarted (I presume when it's completed). I'm running it as a service, is that my problem?

These messages keep happening every hour or so.

[INFO ][logstash.pipeline        ][mediaserver_access_TEST] Pipeline has terminated {:pipeline_id=>"mediaserver_access_TEST", :thread=>"#<Thread:0x262149ab run>"}
[INFO ][logstash.runner          ] Logstash shut down.

If logstash restarts the elasticsearch input will run the same query again and fetch the same data.

The only way I could solve this was by running Logstash on the command line (vs. as a service). I ran it like this...

sudo nohup /usr/share/logstash/bin/logstash --path.settings /etc/logstash/ > ~/logstash_oneshot.log &
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.