Incorrectly collected data in Logstash to Elasticsearch

Hello,

I have set up two processes to collect data and present it to ELK.
On the one hand I have a process that collects log data from an iis and on the other I have another process that what it does is import the data from a daily cvs to a new index in Elasticsearch.

The files that I load in logstash are the following:

input {
  beats {
    port => 5044
  }
}

filter {

        # ignore log comments
        if [message] =~ "^#" {
                drop {}
        }

        # check that fields match your IIS log settings
        grok {

                match => { "message" => "%{TIMESTAMP_ISO8601:log_timestamp} %{WORD:s-sitename} %{NOTSPACE:s-computername} %{IPORHOST:s-ip} %{WORD:cs-method} %{URIPATH:cs-uri-stem} (?:-|\"%{URIPATH:cs-uri-query}\") %{NUMBER:s-port} %{NOTSPACE:cs-username} %{IPORHOST:c-ip} %{NOTSPACE:cs-version} %{NOTSPACE:cs-useragent} %{NOTSPACE:cs-cookie} %{NOTSPACE:cs-referer} %{NOTSPACE:cs-host} %{NUMBER:sc-status} %{NUMBER:sc-substatus} %{NUMBER:sc-win32-status} %{NUMBER:sc-bytes} %{NUMBER:cs-bytes} %{NUMBER:time-taken}" }
        }

        # set the event timestamp from the log
        # https://www.elastic.co/guide/en/logstash/current/plugins-filters-date.html
        date {
                match => [ "log_timestamp", "YYYY-MM-dd HH:mm:ss" ]
                timezone => "Etc/UCT"
        }

        # matches the big, long nasty useragent string to the actual browser name, version, etc
        # https://www.elastic.co/guide/en/logstash/current/plugins-filters-useragent.html
        useragent {
                source=> "useragent"
                prefix=> "browser_"
        }

        mutate {
                remove_field => [ "log_timestamp"]
        }
}

output {
   elasticsearch {
      hosts => ["elasticsearch_server:9200"]
      index => "iis-log-%{+YYYY.MM.dd}"
   }
}

input {
    file {
        path => "/etc/logstash/conf.d/nextcloud-report-usage/*.csv"
        start_position => "beginning"
        sincedb_path => "/dev/null"
    }
}

filter {
    csv {
        separator => ","
        columns => [
            "user-id",
            "date",
            "last_login_date",
            "assigned_quota",
            "used_quota",
            "number_files",
            "number_shares",
            "number_uploads",
            "number_downloads"
        ]
        separator => ","
        remove_field => ["message"]
    }
    date {
        match => [ "date", "yyyy-MM-dd HH:mm:ss"]
        target => ["@timestamp"]
    }
    date {
        match => [ "last_login_date", "yyyy-MM-dd HH:mm:ss"]
    }
    mutate {convert => ["assigned_quota", "integer"]}
    mutate {convert => ["used_quota", "integer"]}
    mutate {convert => ["number_files", "integer"]}
    mutate {convert => ["number_shares", "integer"]}
    mutate {convert => ["number_uploads", "integer"]}
    mutate {convert => ["number_downloads", "integer"]}
}

output {
   elasticsearch {
        hosts => ["elasticsearch_server:9200"]
        index => "nextcloud-metrics-%{+YYYY.MM.dd}"
   }
}

How can I do to have these two processes started simultaneously? That is, the iis logs and the execution of a daily load of data from a csv file.

On the other hand, I have launched logstash to test the data load of the cvs file and see that everything is correct and that it generates the corresponding index correctly, and when I see the log I see the following error:

0390e14f3b27abd3423a1] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"nextcloud-metrics-2022.05.04", :routing=>nil}, {"tags"=>["_dateparsefailure"], "last_login_date"=>"2022-05-02T13:01:07+00:00", "number_uploads"=>0, "used_quota"=>22869957, "log"=>{"file"=>{"path"=>"/etc/logstash/conf.d/nextcloud-report-usage/usage_report_04052022_2.csv"}}, "date"=>"2022-05-02T13:17:40+00:00", "number_files"=>36, "assigned_quota"=>26843545600, "@version"=>"1", "number_downloads"=>6, "event"=>{"original"=>"apiuser,2022-05-02T13:17:40+00:00,2022-05-02T13:01:07+00:00,26843545600,22869957,36,0,0,6"}, "@timestamp"=>2022-05-04T07:35:18.689267Z, "user-id"=>"apiuser", "number_shares"=>0, "host"=>{"name"=>"tfvs0756.medusa.gobiernodecanarias.net"}}], :response=>{"index"=>{"_index"=>"nextcloud-metrics-2022.05.04", "_id"=>"VQP-jYABi-PU5vurrNZ5", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"mapper [last_login_date] cannot be changed from type [text] to [date]"}}}}

And in this process, I also see that it does not load all the data from the csv file. I would have to load 149 records, which are the ones in the file and it only loads 143, not seeing where the problem could be.

I hope you can help me as I am starting to use this tool.

Best regards.

The important thing to realise here is that if you have these as two different config files in Logstash, when you start Logstash it'll actually merge them both into one. That means that every event that comes in via either of your inputs will be run through each filter and output section.

Take a look at Multiple Pipelines | Logstash Reference [8.2] | Elastic for keeping things separated.

Thank you very much for the info.

I have configured the "pipeline.yml" file with the following configuration:

- pipeline.id: iis-log
  path.config: "/etc/logstash/conf.d/iis-log.conf"
- pipeline.id: nextcloud-usage
  path.config: "/etc/logstash/conf.d/nextcloud-usage.conf"

I have restarted logstash, and in the logstash log I have not seen any errors generated, so the process has been executed correctly. Waiting a few minutes, I have verified the creation of the corresponding index, and indeed I can see that the index corresponding to the nextcloud-usage pipeline has been generated, which takes a CVS file as input.

After a while in the logstash log it shows me some Warnings like the following:

[2022-05-05T09:09:39,531][WARN ][logstash.outputs.elasticsearch][nextcloud-usage][beaaf97a5d411a0b3614355a7af34bf31a781e101910c6a563ac776d1ba07303] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"nextcloud-metrics-2022.05.05", :routing=>nil}, {"number_downloads"=> nil, "number_uploads"=>nil, "@version"=>"1", "@timestamp"=>2022-05-05T09:09:38.677781Z, "number_shares"=>nil, "used_quota"=>nil, "host"=>{"name"=>"tfvs0756.medusa.gobiernodecanarias.net"}, "event"=>{"original"=>"D6929E38-9646-4905-A110-1AFF97A35978,2022-05-02T13: 17:40+00:00,2021-0702T08:54:55+00:00,26843545600,,,,,"}, "user-id"=>"D6929E38-9646-4905-A110-1AFF97A35978", "assigned_quota "=>26843545600, "date"=>"2022-05-02T13:17:40+00:00", "tags"=>["_dateparsefailure"], "last_login_date"=>"2021-0702T08:54:55 +00:00", "log"=>{"file"=>{"path"=>"/etc/logstash/conf.d/nextcloud-report-usage/usage_report_05052022.csv"}}, "number_files"= >nil}], :response=>{"index"=>{"_index"=>"nextcloud-metrics-2022.05.05", "_id"=>"WA57k4ABi-PU5vurZTOO", "status"=>400, " error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [last_login_date] of type [date] in document wi th id 'WA57k4ABi-PU5vurZTOO'. Preview of field's value: '2021-0702T08:54:55+00:00'", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"failed to parse date field [2021-0702T08 :54:55+00:00] with format [strict_date_optional_time||epoch_millis]", "caused_by"=>{"type"=>"date_time_parse_exception", "reason"=>"Failed to parse with all enclosed parsers"}} }}}}

And on the other hand, I have also verified that the cvs file that it takes as input has a total of 149 records, and in the index I only see 111. It is not loading all the data from CVS to ELK.

I have carried out a manual load test of the same file but using ELK's Data Visualizer, I load the file and follow the procedure indicated on the screen, and I have no problem importing all the content.

Below you can see the difference between the index created manually with the data uploaded using the Data Viisualizer and the index created from the pipeline configured in Logstash

yellow open nextcloud-usage-t1 Oc3cVTn2S4euHUDZNMeKhg 1 1 149 0 30.6kb 30.6kb
yellow open nextcloud-metrics-2022.05.05 AecqNQiATwaXCKqsfXHb4g 1 1 111 0 133.4kb 133.4kb

How can I see why the rest of the data is not loaded?

Kind regards and thanks again.

That's why.

Thanks for the answers.

Logically he saw that error, but what I don't understand is how there are records that have been added to the corresponding index, and then some records have this error, when all the records have the same fields between them "last_login_date".

They may have the records, but the format is not a correct timestamp.

Ok. I undertand, but I don't know how solve this problem. I have to load dailly a cvs file with data to ELK.

You need to fix the timestamp.

Thanks for the reply.

Couyld you tellme how can i fix the timestamp? I don0t know how to fix it.

Thanks!

You probably want to look at the source data and see why it's sending wrong info.

This is an example of data that I try to load from CVS in logstash to see in Kibana:

user-id,date,last_login_date,assigned_quota,used_quota,number_files,number_shares,number_uploads,number_downloads
admin,2022-05-02T13:17:40+00:00,2022-05-02T13:14:51+00:00,26843545600,23203889,142,2,0,43
alexzolotov,2022-05-02T13:17:40+00:00,2022-0401T16:44:15+00:00,26843545600,22941339,44,0,40,340
apiuser,2022-05-02T13:17:40+00:00,2022-05-02T13:01:07+00:00,26843545600,22869957,36,0,0,6
guille,2022-05-02T13:17:40+00:00,2022-03-17T10:56:49+00:00,26843545600,22869956,35,0,0,0
u_38999997_pro4,2022-05-02T13:17:40+00:00,2022-0409T07:09:57+00:00,26843545600,22941339,35,0,33,4
ext-jacozam,2022-05-02T13:17:40+00:00,2022-05-02T10:48:34+00:00,26843545600,386254691,68,19,7,194
ext-ffebarm,2022-05-02T13:17:40+00:00,2022-05-02T07:44:09+00:00,26843545600,384652,185,262,10,211
ext-ddahper,2022-05-02T13:17:40+00:00,2022-0409T08:30:16+00:00,26843545600,17040873,35,0,0,0
ext-hpercej,2022-05-02T13:17:40+00:00,2022-05-02T10:58:26+00:00,26843545600,18668177,37,2,0,6

In the log of logstash when read the csv file, the warn advise that I see is over the row for example "admin",that see

[2022-05-10T11:14:26,021][WARN ][logstash.outputs.elasticsearch][nextcloud-usage][93ab60cee89ed4e28db11de1102c5e01a66e6547e64eb0bab77110933bf027c3] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"nextcloud-metrics-2022.05.10", :routing=>nil}, {"last_login_date"=>"2022-05-02T10:48:34+00:00", "log"=>{"file"=>{"path"=>"/etc/logstash/conf.d/nextcloud-report-usage/usage_report_10Mayo2022.csv"}}, "date"=>"2022-05-02T13:17:40+00:00", "number_files"=>68, "number_shares"=>19, "@timestamp"=>2022-05-10T11:14:25.307689Z, "assigned_quota"=>26843545600, "host"=>{"name"=>"tfvs0756.medusa.gobiernodecanarias.net"}, "tags"=>["_dateparsefailure"], "number_downloads"=>194, "number_uploads"=>7, "user-id"=>"admin", "used_quota"=>386254691, "event"=>{"original"=>"admin,2022-05-02T13:17:40+00:00,2022-05-02T10:48:34+00:00,26843545600,386254691,68,19,7,194"}, "@version"=>"1"}], :response=>{"index"=>{"_index"=>"nextcloud-metrics-2022.05.10", "_id"=>"ZjOtrYABi-PU5vurbePW", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"mapper [date] cannot be changed from type [text] to [date]"}}}}

I don't know why for ones data no error but for other there are warn and not load all datas in a index.

Thanks!

I have not tested this in Elasticsearch, so this is speculation. If the first document that is indexed has the [date] field set to "date" (i.e. the csv header arrives first) then unless you have a template defined the [date] field type will be set to "text". If a later document arrives with a [date] field that looks like a timestamp then Elasticsearch might (this is the speculation bit) use date detection to try to make the [date] field a "date", resulting in that mapping exception.

Since logstash does not (by default) maintain the order of events I would expect this to be intermittent. The solution would be to drop the header or create a template to set the field type.