Data loading in real time using logstash csv input


(Sabyasachi Mallick) #1

i am fetching data using a python script and dumping into a csv file. And using that csv file as input in my logstash config file.Each time my python code runs , it opens and overwrites all that csv fil.But sometimes i m not getting all the data in elastic .Not able to figure out the problem.
is it because the sciencedb of logstash ?
Or is there any better way to update everything in real time?


(João Duarte) #2

if you're doing using logstash as a batch importer, using the file input plugin can cause troubles since it was made for streaming contents of files.

I suggest using instead something like the stdin input plugin. This way once the python code finishes running, you can run cat file.csv | bin/logstash -f config.txt. In this situation, logstash will terminate once it finishes processing the file.


(Guy Boertje) #3

Alternatively, in your python script, you could give each file a different name, using a uuid for example, and delete all files that do not have that uuid as its name. Make sure you do the delete after the new file write to minimize the chances of the same inode being reused.

The reason why you get data some or all data is that, while LS is running it keeps track of the amount of bytes processed by inode in memory and periodically to disk. If it detects a file size change when you overwrite the file contents, it does one of three things; 1) if the size is less than previous - it sees this as wholly new content and rereads the file 2) if the size is the same - it does nothing; 3) if the size is more than previous - it reads the bytes from the previous size point up to the new size (this is the "tailing" behaviour).


(Sabyasachi Mallick) #4

Thanks @guyboertje ! will try your solution .
If i ll make sincedb path as null ,then i think logstash ll not keep track of preious file informations. so that when the file'll get updated, it ll fetch all the data from that file by using start position=>beginning . this is my understanding . is it right?


(Guy Boertje) #5

No.

When you set the sincedb path to /dev/null - you are telling the file input that you do not want the in-memory map of inodes -> position written to a file.

Normally people do want LS to load these positions from file when LS restarts - you should do as well, because, with /dev/null, if LS restarts it will reread the file from the beginning.

Logstash will always use the in-memory map to track inodes vs positions.


(Sabyasachi Mallick) #6

I wrote code to generate new uuid every time . and after writing data to a new csv file i am deleting the old files.
It worked perfectly for 2 days but again i am facing same problem. Logstash updated very less data yesteraday. Not able to figure out what is the problem. If inode is reused again, then how to avoid it completely?


(Guy Boertje) #7

How many files are you writing? One per day, 10 per hour?

I suggest that you use a different input.

Perhaps you could use python to write to Redis and use the Redis input to read the CSV strings.
For a more robust solution migrate to Kafka if necessary.

The Logstash file input was not designed to work for your use case.

Where/what is your python script getting the CSV data from?


(Sabyasachi Mallick) #8

i am writing one file in 4 hr time frame. But each file contains around 45000 rows.
i feel like even if i am giving the files different names but still at some point of time it is being reused.
May be because, when storage is getting full, i am freeing space by deleting the files.
To avoid this problem , in every 3/4 weeks i am stopping logstash,deleting sincedb file and then deleting all the files, then restarting the logstash. But this is not a good solution in a long run.
So is there any way where i can delete files but still logstash will not use previous inodes.

I will try your suggestions to check which ll be better for my requirements.

Thanks


(Guy Boertje) #9

In a word - no, the file input was not designed for this.

Its up to the file system how it recycles inodes. Because of log file rotation, the file input was designed to remember the inode and not the file path/name - the assumption being that if one is tailing a file and it is rotated in some way then the inode points to the content that was read.

There is a fixed set of inodes created when the disk partition is made. When they are used up, then the disk is full. When a file is deleted its inode is freed to be used again.

Logstash, while its running, cannot detect that the logical usage of an inode has been changed.


(Guy Boertje) #10

There is a possible exploit of the file input internals.

If you switch back to using one file name (scrapping the UUID filename) AND you simulate file rotation - it might work.
A record in the sincedb file is <inode (int int int)><space><bytes_read>

  1. When you are ready to overwrite the file content, read the sincedb file and compare the last number (bytes read) to the bytes you wrote 4 hours ago
  2. Wait while bytes_read < bytes_wrote (Logstash is processing lines)
  3. Truncate the old content in the file.
  4. Re-read the sincedb file and wait while the bytes_read is not zero.
  5. Add the next lot of lines to the file.

See https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-sincedb_write_interval
Make it 1 second.

Good Luck.


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.