File input not updating

Hi,

I've got a small issue with the updating of an index from a csv. The PowerShell script I've written to convert a xml file into a csv file works by copying the xml file to the local storage, converting it and writing the new csv to a separate folder. I then use file input in Logstash to pull in the csv file, mutate it and push it to its index in ES. All works as expected except for one small issue. When the scheduled PS script runs again, it overwrites the csv with the latest copy. Logstash isn't seeing this as a change thus isn't updating the index. Not sure how I can get around this. Deleting the original csv before updating with the new one doesn't seem to make sense as it would have the same file name. Config file shown below.

input {
      file {
        type => "audit"
    	path => "C:/Target_Directory/*.csv"
		start_position => beginning
    	sincedb_path => "/dev/null"
      }
    }

filter {
  if [type] == "audit" {
  csv {
   separator => "|"
		columns => [ "Provider","EventID","EventName","Version","Source","Level","Opcode","Keywords","Result","TimeCreated","Correlation","Channel","Computer","ComputerUUID","Security","ProviderGuid","SubjectIP","SubjectUnix","SubjectUserSid","SubjectUserIsLocal","SubjectDomainName","SubjectUserName","ObjectServer","ObjectType","HandleID","ObjectName","AccessList","AccessMask","DesiredAccess","Attributes","OldDirHandle","NewDirHandle","OldPath","NewPath","InformationSet"
		]
	}

	#Drop the first line of the import
	if [Computer] == "Computer" {
    drop { }
	}

	#Drop Successful audits (we only care about failures)
	if [Result] == "Audit Success" {
    drop { }
	}
	
	Drop any entries for PCs (we only care about usernames)
	if [SubjectUserName] =~ "$" {
    drop { }
	}
	
    date {
            match => [ "TimeCreated", "dd/MM/yyyy HH:mm:ss" ]
            target => "@timestamp"
			timezone => "Europe/London"
         }
		 
    mutate {
		remove_field => [ "Provider","EventID","Version","Source","Level","Opcode","Keywords","Correlation","Channel","ComputerUUID","Security","ProviderGuid","SubjectUnix","SubjectUserSid","SubjectUserIsLocal","ObjectServer","HandleID","AccessList","AccessMask","DesiredAccess","Attributes","OldDirHandle","NewDirHandle","OldPath","NewPath","InformationSet" ]
		remove_field => [ "message" ]
    	gsub => [ "Result", "Audit Success", "Success",
				  "Result", "Audit Failure", "Fail" ]
		split => { "Computer" => "/" }
		add_field => { "SVM_Name" => "%{[Computer][1]}" }
		}
}
}
    output {
    	if [type] == "audit" {
        elasticsearch {
        index => "audit-%{+YYYY.MM.dd}"
		hosts => ["elk.server.com:9200"]
		#user => elastic
        #password => changeme
        }
    }
    }

Any suggestions as to why it fails to see the change? I'm assuming it has something to do with the file being replaced and not amended because other csv imports I have set up in a similar fashion work fine. The only difference is on the others, the csv gets amended, not replaced.

Thanks.

The file input tracks content by inode in *nix and by a kernel32 call to get file info on ntfs but it discovers files initially by path. IMPORTANT: it also assumes a "tailing" use case.

I don't know what Windows returns in that kernel32 when a file (path) is deleted and recreated with new content. I'm not sure if it is a new identifier or the same one as before.

When you say it overwrites the csv with the latest copy do you mean that the script opens the file, truncates it and writes new content or does in use OS level overwrite?

My advise, you are deeply in uncharted waters here, there are not many people here with the specific experience of what you are trying to do. You should attempt to mimic the "tailing" scenario as much as possible.

  • append to the CSV file.
  • use a real file for sincedb tracking.
  • use log rotation (difficult depending on windows version) or a stop LS, truncate file, start LS mechanism daily/weekly depending on volumes.

Good Luck.

Thanks for the reply.

Based on your feedback, I'd say that Windows creates a new identifier as I don't have this issue with a similar setup where I amend the csv. So second question, as I can't amend the CSV (because it would grow to a very large size in a short space of time), if I was to drop each copy of the exported csv into the folder with a new name and then delete the older one, Logstash would take the contents of the new file and add it to the index.

Does that sound like a practical (albeit long-winded), solution? As I'm importing *.csv from said folder, I'm assuming Logstash would automatically see the additional file get dropped into the folder and then import it?

Yes I think adding a new file each time will work. Make sure you use close_older to ensure that the file is closed after a time (say 2 hours in seconds), tailing mode means the file is held open because more content is expected, otherwise deleting the "done" file(s) will result in an error in windows.

Also, for this proposed solution, consider stat_interval vs discover_interval. In this scenario we are trying to mimic full file read in an assumed tailing operation. Each file, when discovered is put into a "queue". Files in the queue are read initially in full (to EOF) then looped over to see if they have grown/shrunk. This loop sleeps for the stat_interval seconds before looping around and in each loop it increments a counter, when this counter reaches the discover_interval value a discover operation is performed and the counter is reset. What this means to you is that if you start with one file it will be completely read before the sleep occurs then the loop sleep loop continues until discover is called. Depending on how often the PS script is scheduled to run you could set stat_interval to 2 or 5 seconds and discover_interval to 1 (i.e. discover in each loop) to make the new file detection more responsive. However if the CSV file is very big and it takes LS many seconds to process all the lines and get them into ES then it does not make sense to sleep in the loop for very long - you could set stat_interval to 0.1 - then discover is done pretty much at the end of each file read completion.

Thanks for the additional info, really helpful.

I've not had a chance to look at this for a few days and am only getting to it now. So, just to check I've understood this correctly then, I have 2 more questions:

  1. As my PS script will delete the existing csv file from the folder LS is watching and add in a new csv file with a new random file name on every write, would I actually need the specify the intervals? Wouldn't the new file be picked up by the fact that close_older is set as default to close the input file after 1 hour so once the PS script deletes it, LS will itself stop watching that file after 1 hour (even if technically it's not there anyway).

  2. If I go with the interval setting on the file input, my PS script runs once an hour, on the hour so I'm thinking that a stat_interval of 300 (5 mins) and discover_interval to 1 should be good enough for my needs. I don't have to get this data into ES super quickly as the end result for this data is merely a Kibana dashboard for info purposes.

Thoughts?

Thanks.

So, to update incase this helps anyone else, I've found the simplest method was to leave stat_interval and discover_interval at it's default setting and change close_older to 20 mins. I've altered my PS script to add the time to the outputted file meaning that it's always outputting a differently named csv each time it runs. It also deletes the existing ones in the folder and Logstash now releases the hold (on a file that no longer exists) after 20 mins form last touching it.

Thanks for all your input.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.