I have some million json files that i separated in multiple sub folders. Basically these are Cloudtrail data for a year. My Logstash version is 5.5 and this is a 8 core , 32 GB system. The problem is that when i run my logstash that uses a file input plugin and outputting to elasticsearch. It runs for couple of time and then dies with java heap space. Can someone please help me on this. I am running out of ideas now.
The file input isn't built to process filename patterns that expand to millions of files. You'll have to process them in smaller numbers, e.g. by writing a small script that reads the millions of files and copies the data to a small(er) set of files that you point Logstash to. Another option could be to send the file to Logstash over a socket or a broker. A broker like RabbitMQ will help you with backpressure if Logstash isn't able to consume the messages fast enough.
Can you recommend a solution where i have *.gz files coming in from amazon cloudtrail and those have json files in it and these json files doesn't have a new line in it. Is there a way i can ingest files without having to process of unpacking the zip and adding new line to every json present. I am having a hard time processing these data.
If these files are among the million files you'll probably have to process them outside of Logstash anyway so I don't know if it's such a big problem. Not sure what you mean by "doesn't have a new line in it". Are the files lacking a trailing newline or what?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.