So I'm using Logstash to load a csv file having 4998079 records in it. As it is loading I'm tracking the count in Elasticsearch and I notice that the count goes well beyond 5M and keeps climbing.
Then I notice that the console shows the following message:
A plugin had an unrecoverable error. Will restart this plugin.
Plugin: <LogStash::Inputs::File path=>["/home/zed/logstash-2.1.0/source_data/nga/AIS_Ships_PD-3874.csv"], sincedb_path=>"/home/zed/logstash-2.1.0/sincedb_path/AIS_Ships_PD-3874.since", type=>"core2", start_position=>"beginning", codec=><LogStash::Codecs::Plain charset=>"UTF-8">, stat_interval=>1, discover_interval=>15, sincedb_write_interval=>15, delimiter=>"\n">
Error: No such file or directory - /home/zed/logstash-2.1.0/sincedb_path/AIS_Ships_PD-3874.since.14300.17459.522421 {:level=>:error}
Which explains that the load process has restarted from the beginning of the file, and that is why the record count continues to climb.
So out of curiosity, I wait to see if it will halt at the end of the second load. But ES issues the same fatal error message, and then restarts the data load for the third time. So I killed the logstash process.
The problem, which is clear from the error message, is that the sincedb_path value specified a path which did not exist.
The consequence is that this sets up an infinite loop. Not good.
My thought is that ES should check path and permissions to verify that the sincedb file can be created BEFORE starting the load.
Is there any reason why the file plugin ES (or Logstash) does not do this?
Yes. I would agree with you. It's a logstash error.
So, do you think this would be an error in the file plug-in? Or is it a feature based on some kind of optimization question.
An honest question. My experience suggests that, when one starts working with a system as complex and configurable as LS or ES, one has to decide how much trust to place in the system defaults. And what that usually means is that system defaults are not necessarily bugs, but instead are optimizations for a different use case. (So is there a use case for LS not to check the validity of the sincedb_path value?)
It's sort of like in a relational database the decision of whether or not to have an index. If you want all or most of the records in a table, then you don't want to use the index, but if you're fetching one a just a few records, then you do. And in Oracle or any other relational database, the optimizer usually make a decent decision on whether to use, or not.
But in LS or ES, there is no optimizer. (Is there?) So in the back of my mind is the possibility that I will--fairly easily--find some dark corner where something untoward happens.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.