Logstash Elasticsearch Input - Query in @timestamp order

Paul_Ainslie · December 5, 2019, 12:18am

I'm querying an index with Nginx access records, running it thru the Aggregate filter plugin and writing it to a file. However, the elasticsearch input plugin is not reading the records in @timestamp order - they are written in a random order.

Furthermore, the logstash job has been running for over 4 days on an index that has 17 million records and 11.2 GB of data. I wonder if it's timing out and starting from the beginning again.

Any suggestions given the following config?

    input {
        elasticsearch {
          hosts   => ["https://m-elsc-006.on4cdn.net:9200"]
          index   => "mediaserver-2019.11.11-000020"
          user          => "elastic"
          password      => "******************"
          size    => 5000
          scroll  => "5m"
          query   => '{ "sort": [ "@timestamp" ] }'
          docinfo => true
          docinfo_fields => [ "_type", "_id"]
        }
    }

Of note, I always get this error (even without the aggregate filter)

[WARN ][org.logstash.instrument.metrics.gauge.LazyDelegatingGauge][mediaserver_access_TEST] A gauge metric of an unknown type (org.jruby.RubyArray) has been create for key: cluster_uuids. This may result in invalid serialization.  It is recommended to log an issue to the responsible developer/development team.

Badger · December 5, 2019, 12:50am

If pipeline.java_execution is enabled (which became the default in v7.0) then logstash will re-order events even with pipeline.workers set to 1.

If you have pipeline.workers set to 1 and java_execution disabled then you may have an elasticsearch question rather than a logstash question.

That WARN is completely normal in recent versions.

Paul_Ainslie · December 5, 2019, 4:17am

I'm running Logstash 7.4 (and reading from Elasticsearch 7.4). I've now set the following in logstash.yml and it's the same problem. It's processing 5 days worth of log files... it gets thru 5 days of records in @timstamp order (~33,000 records out of 17.1 million) , then loops back to the beginning and processes another 5 days/~30,000 records, etc.

pipeline.workers: 1
pipeline.java_execution: false

BeMoore · December 5, 2019, 10:46am

Paul,

It sounds like you need to use a sincedb to maintain track of where your logstash is getting to in its reading of the logfiles.

read this

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#_tracking_of_current_position_in_watched_files

i would create a directory inside your /etc/logstash called sincedb
then add the following to your input section of your logstash .conf in conf.d folder

sincedb_path => "/etc/logstash/sincedb/myfilesincedb_name.dat"

this way, it should only read to the last recorded point ( in that dat file you'll see a timestamp ) it won't re-read this whole file, but only events after the time stamp.

So another trick for ingestion from beginning is to also set in the input section of your conf file
start_position => "beginning"

so that is knows to start at the top and work its ways down to the end. once its completed and change the "beginning" to "end" and it will only ever start from the end of the log file. You can also mix and match start_position and sincedb in the same conf, it just depends on what your aim is. Experiment and have fun!

Paul_Ainslie · December 5, 2019, 1:36pm

The synced_path option doesn't work with the elasticsearch input plugin, I get [ERROR]..."Something is wrong with your configuration."

I'm now trying it without the aggregate filter plugin (see below) - still the same problem.

input {
    elasticsearch {
      hosts   => ["https://m-elsc-006.on4cdn.net:9200"]
      index   => "mediaserver-2019.11.11-000020"
      user          => "elastic"
      password      => "**************"
      size    => 5000
      scroll  => "10m"
      query   => '{ "sort": [ "@timestamp" ] }'
      docinfo => true
#      docinfo_fields => [ "_type", "_id"]
    }
}

output {
    file {
        path => "/var/log/logstash/2019.11.11-000020_TEST.json"
        codec => "json_lines"
    }
}

Paul_Ainslie · December 5, 2019, 10:53pm

@Badger You mentioned that, "...you may have an elasticsearch question rather than a logstash question." Is there any way for me to test and confirm that this is an Elasticsearch issue?

Badger · December 5, 2019, 10:57pm

Is the pipeline getting restarted?

Paul_Ainslie · December 5, 2019, 11:09pm

The logstash-plain.log file doesn't show any issues. But it's only reading a small percentage of the index and going back to the beginning to read more records. Not sure if that's what you're asking.

Badger · December 6, 2019, 1:18am

If the pipeline restarted it would re-run the same query and start fetching the same set of records. That suggests to me that it is being restarted. I would expect that to get logged.

When I said you might have an elasticsearch question I meant that the query might be wrong.

Paul_Ainslie · December 6, 2019, 1:46am

My bad, yes, the pipeline is getting restarted (I presume when it's completed). I'm running it as a service, is that my problem?

These messages keep happening every hour or so.

[INFO ][logstash.pipeline        ][mediaserver_access_TEST] Pipeline has terminated {:pipeline_id=>"mediaserver_access_TEST", :thread=>"#<Thread:0x262149ab run>"}
[INFO ][logstash.runner          ] Logstash shut down.

Badger · December 6, 2019, 1:38pm

If logstash restarts the elasticsearch input will run the same query again and fetch the same data.

Paul_Ainslie · December 7, 2019, 9:33pm

The only way I could solve this was by running Logstash on the command line (vs. as a service). I ran it like this...

sudo nohup /usr/share/logstash/bin/logstash --path.settings /etc/logstash/ > ~/logstash_oneshot.log &

system · January 4, 2020, 9:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sorting with elasticsearch input Logstash	1	986	July 6, 2017
Missing and disordered when using logstash elasticsearch input and file output Logstash	3	843	March 23, 2017
Querying elastic for very recent log Logstash	6	437	April 28, 2018
Documents order in Elasticsearch after processing with filebeat/logstash Logstash	5	1875	September 19, 2017
Logstash elasticsearch input reads same set of data everytime Logstash	4	669	July 8, 2020

Logstash Elasticsearch Input - Query in @timestamp order

Related topics