Infinite loop in ES, possible bug, please comment

So I'm using Logstash to load a csv file having 4998079 records in it. As it is loading I'm tracking the count in Elasticsearch and I notice that the count goes well beyond 5M and keeps climbing.

Then I notice that the console shows the following message:

A plugin had an unrecoverable error. Will restart this plugin.

  Plugin: <LogStash::Inputs::File path=>["/home/zed/logstash-2.1.0/source_data/nga/AIS_Ships_PD-3874.csv"], sincedb_path=>"/home/zed/logstash-2.1.0/sincedb_path/AIS_Ships_PD-3874.since", type=>"core2", start_position=>"beginning", codec=><LogStash::Codecs::Plain charset=>"UTF-8">, stat_interval=>1, discover_interval=>15, sincedb_write_interval=>15, delimiter=>"\n">
  Error: No such file or directory - /home/zed/logstash-2.1.0/sincedb_path/AIS_Ships_PD-3874.since.14300.17459.522421 {:level=>:error}

Which explains that the load process has restarted from the beginning of the file, and that is why the record count continues to climb.

So out of curiosity, I wait to see if it will halt at the end of the second load. But ES issues the same fatal error message, and then restarts the data load for the third time. So I killed the logstash process.

The problem, which is clear from the error message, is that the sincedb_path value specified a path which did not exist.

The consequence is that this sets up an infinite loop. Not good.

My thought is that ES should check path and permissions to verify that the sincedb file can be created BEFORE starting the load.

Is there any reason why the file plugin ES (or Logstash) does not do this?

-- Chris curzon

That's an LS error, not an ES one though?

Yes. I would agree with you. It's a logstash error.

So, do you think this would be an error in the file plug-in? Or is it a feature based on some kind of optimization question.

An honest question. My experience suggests that, when one starts working with a system as complex and configurable as LS or ES, one has to decide how much trust to place in the system defaults. And what that usually means is that system defaults are not necessarily bugs, but instead are optimizations for a different use case. (So is there a use case for LS not to check the validity of the sincedb_path value?)

It's sort of like in a relational database the decision of whether or not to have an index. If you want all or most of the records in a table, then you don't want to use the index, but if you're fetching one a just a few records, then you do. And in Oracle or any other relational database, the optimizer usually make a decent decision on whether to use, or not.

But in LS or ES, there is no optimizer. (Is there?) So in the back of my mind is the possibility that I will--fairly easily--find some dark corner where something untoward happens.

Your thoughts?

Why does this happen? Does the file get created at all? Does the directory exist? Does the LS user have the permissions to write to this directory?

The failure was because the directory didn't exist.

The file system had "sample_data" in the path.

But the config file had "sample_date" specifie as the path to use.

So the directory required by the config file did not exist.

I'm just wondering why LS did not check this fact before starting the processing.

I just did a test. In the config file I specified

input {
file {
path => "/home/zed/logstash-2.1.0/source_data/gdelt/20151124.export.CSV"
sincedb_path => "/homex/zed/logstash-2.1.0/load_conf/gdelt.since"
type => "gdelt"
start_position => "beginning"

notice I gave /homex/zed.... as the sincedb_path value. This path doesn't exist.

But configtest, says this configuration is OK

$ bin/logstash -f load_conf/gdelt01.conf --configtest

Configuration OK

So I have to be very careful in this regard.

Feel free to raise an issue on the file input repo on this, it does seem this should be checked.

I will do as you suggest.

As a newbie, can you tell me how? Or are there instructions for me to follow?


Sure, just create a new issue here -