Logstash inconsistency while reading csv data

Hello,
We are using Logstash for parsing csv data and load them into Postgresql and then after making proper transformation we move that data to Elasticsearch by using same Logstash . We don't have any problem about transferring data from Postgresql to Elasticsearch however , while reading csv there are some inconsistent things occurring.

I have a csv file with count of 65000. Csv file has us-ascii encoding. Logstash parse this file daily basis and store the data in Postgresql which has same model (columns) with csv. Generally Logstash reads that data without any problem but in some days Logstash reads much more less of that data and store them in Postgresql without giving any error. Like I said csv file has 65k lines (this value can change day to day) , Logstash reads like 20-30 (depends each day) of 65k data without giving any error.

I change logging level to Debug but I couldn't find anything. As a workaround ,after deleting sincedb , it reads all data successfully and store them into DB.

This happens randomly , there is no specific pattern for that occurrences.

Example of sampleData.csv:

COLUMN1,COLUMN2,COLUMN3
DATA1,DATA2,DATA3
sampleFile.conf

input {
    file {
        path => "/app/data/sampleData*.csv"
        start_position => "beginning"
        add_field => { "[@metadata][appname]" => "sampleData" }
        sincedb_path => "/var/lib/logstash/data/last/.sampleData"
        codec => plain {
        charset => "ISO-8859-1"
        }
    }
}

filter {
    if [@metadata][appname] == "sampleData" {
      csv {
        separator => ";"
        columns => ["column1","column2","column3"]
      }
      if [column1] == "COLUMN1" { drop {} }
      mutate {
        remove_field => ["@version", "@timestamp", "message", "path", "host"]
      }
    }
}

output {
    if [@metadata][appname] == "sampleData"
    {
        jdbc {
            connection_string => "jdbc:postgresql://X.X.X.X:5432/sampleData"
            username => "username"
            password => "password"
            max_pool_size => "3"
            statement => [ 'INSERT INTO schema.sampleData ("column1","column2","column3") VALUES (?, ?, ?)', "column1","column2","column3"]
        }
    }
}

Things I've tried so far:

  1. Changing encoding option of sampleFile.conf.
  2. Removing codec part of sampleFile.conf.
  3. Looking whether csv file has any special char or unformatted data.
  4. Looking the EOL marker of csv file.

I've not found any proper solution yet . I wonder if anyone has come across with same situation ? Could you please suggest any advice ?

Thanks

Can you add another output writing to a file and compare the results in the file with the results in the database to check if this is really logstash that has an issue and not the output database?

Also, do you have anything in the logs? Can you look the log and check for any WARN/ERROR logs? You can disable the DEBUG level is this is just noise most of the time and makes harder to troubleshoot anything.

Is your source file complete written when logstash starts to read it or it is getting updated while logstash is reading it?

How does the data looks like? Can you share some sample real data? Do you have line breaks inside the columns?

CSV and file output was added to specific logstash conf for observing. Since each file is processed once a day, I can only check it the next day. I will get back to you regarding this issue.

There is nothing specific in log files. I have 4 distributed instance with nearly 40 different configurations (pipelines) , Debug is set for only 1 instance. There are other servers which has same inconsistency problem, their log level is not debug, but there is no clue about the error either. Data is sensitive for our project so I cannot share details but it includes only specific warnings about encoding which is also we are aware of and taking some actions to fix that. However, there are no errors.

Data is collecting from different kind of sources and save as csv files on local disk. A tool grab that data and land it to our server by using SFTP connection.

I can't share the data but its generally simple plain text which has max 12-13 columns and different eol. Some of them have linux eol , some of them have ms encoding.

As a quick preview i can only share following screenshots :

I've tried this option and saw that if the input file has 1k rows and logstash could parse it without any error then csv output will be the same row count. If logstash couldn't parse the csv file correctly then it writes only the values which can be parsed correctly. I mean if the csv file has 1k rows and logstash parses 100 rows then csv output file has same 100 rows. In error logs it shows that logstash parse the event line from different position rather than beginning.

For example if csv has line something like this:

"Ankara , 123-HOP - 123456 , 10.10.10.10 , some text input"

Logstash parse it from staring of like this:

"123-HOP - 123456 , 10.10.10.10 , some text input" or

"10.10.10 , some text input"

However, interestingly, if I delete or rename sincedb , logstash parse all the csv file without giving any error.

Is the /app, the path where your csv files are stored, a network share?

Its a network shared (nfs) folder which is mounted to our production servers.

Now it makes some sense, this is probably the issue.

The file input does not work well reading data from network shares, unfortunately there is not much you can do.

Sorry for late reply , I have been dealing with some other issues as well. Although my team mates moved the raw data (which is parsed by logstash) from nfs to local disk , this problem still persist. Do you have any alternative/advice ?

It should not be the same issue.

Are the files being written directly into the local disk or are they being constantly moved from nfs to local disk?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.