I am working on large scale data ingestion into Elastic Search indexes.
We have a number of database records that we want to ingest into ES after we have detected delta changes in the database. We have .NET Services that run and works out delta changes these delta changes are created as CSV files along with matching config files.
These pair of files are then picked up and passed into logstash using command logstash -f myconfig.conf
We have noticed that the initial start up times are quite a lot when we call the command, is there a way or a setting to streamline the start up times?
We are running a number of data changes and have started experiencing that logstash itself becomes a bottleneck.
Yes these differ with each run, we have .NET services creating a unique set of CSV files with data, each of these files is paired with a config file that varies on type of data.
config files also have a bit of coding logic to transform the data.
Okay, but then you should be able to run a single Logstash instance and just reload its configuration whenever you've created a new config file. Reloading the configuration is a lot faster than firing up a new JVM. You'll have to tag each event in the input so that you can pair them with the right filters and outputs.
Would you have an example of this ? that will help us immensely.
Also I have been going through the forums and on the web for how to detect when file input plugin has finished ingesting data, this is like a secondary trigger for us to verify that data is in elastic.
Again when i search here and also on git forums, i found all sorts of suggestions but nothing that we could utilise.
Next time i am in your town.. i will definitely buy you a drink
Would you have an example of this ? that will help us immensely.
input {
file {
...
tags => ["some unique identifier"]
}
}
filter {
if "some unique identifier" in [tags] {
...
}
}
output {
if "some unique identifier" in [tags] {
...
}
}
Also I have been going through the forums and on the web for how to detect when file input plugin has finished ingesting data, this is like a secondary trigger for us to verify that data is in elastic.
Your best bet it to read and parse the sincedb file.
Start a single Logstash process with logstash -f path/to/directory and add/remove files in that directory as necessary. Enable Logstash's autoreload of config files or explicitly ask Logstash to reload the config when you've made changes (but I don't know if that works on Windows).
Would logstash allow us to delete a log file when it finishes processing it ?
Logstash itself doesn't care if files are deleted but Windows might not allow the file deletion while Logstash has the file open. You may have to tune the file input's close_older option.
I am back with another query it seems. I tried the approach that you suggested of running scans on a pre configured folder location.
I have noticed that once the file is pickedup by logstash we start seeing following error messages
[2017-11-14T22:14:09,135][WARN ][logstash.licensechecker.xpackinfo] Nil response from License Server
[2017-11-14T22:14:39,131][ERROR][logstash.licensechecker.licensemanager] Unable to retrieve license information from license server {:message=>"undefined local variable or method `bad_response_error' for #<LogStash::LicenseChecker::LicenseReader:0x7cae3ad8>", :class=>"NameError"}
[2017-11-14T22:14:39,140][WARN ][logstash.licensechecker.xpackinfo] Nil response from License Server
[2017-11-14T22:15:09,131][ERROR][logstash.licensechecker.licensemanager] Unable to retrieve license information from license server {:message=>"undefined local variable or method `bad_response_error' for #<LogStash::LicenseChecker::LicenseReader:0x7cae3ad8>", :class=>"NameError"}
[2017-11-14T22:15:09,133][WARN ][logstash.licensechecker.xpackinfo] Nil response from License Server
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.