I'm currently trying to ingest 100gb of csv files into elasticsearch through logstash. The issue is it's taking forever. I have narrowed down the columns I'm trying to filter for to 8 out of 71 but it still takes a long time to ingest them. Is there a faster set up I could use? I have my batch size set to 10000, workers set to 8, and this is my .conf file I'm using:
Thanks! my team at work has sort of made me the defacto ELK guy so I've been trying to get a something put together that would allow us to upload a lot of files to search through. I would say the current speed is about 5mb/min to 1gb/hr. Currently that amounts to about 1 file every 5 minutes or so and since we have hundreds of files we're trying to ingest it just isn't optimal enough yet.
I haven't tried removing stdout yet since I have been sort of using it to gauge the current progress of the upload, and as a quick confirmation files are ingesting. Do you really think it could be slowing things down that much? I just feel like I've got to be doing something wrong for it to be ingesting so slowly, and assume there's gotta be a setting(s) I can change to optimize it.
100% it is, you're echoing everything out to the console. If you really want to keep track of where things are at it might be easier to put the output into a file and then use linux tools to view that instead.
What is the mapping for the index? How many shards does it have? How large is your Elasticsearch cluster?
Ah that makes a lot of sense, I'll go ahead and try removing the stdout. Also it's being ran on VM locally with windows as the OS. The plan is to move to a clustered model with 2 VM's running Elasticsearch, 1 running Kibana and another running logstash but we're low on storage space and man power at the moment.
In regards to mapping I'm still new to ELK and don't think I've made any changes unless defining CSV as the filter type counts, otherwise it would just be default mapping I'd assume and I'm only running one shard at the moment. I haven't configured anything out past trying to run multiple pipelines and messing with the batch_size, workers, and heap size. Also with the lack of available VM's everything is being ran locally on a single VM in the E drive.
Apologies, to clarify there are 71 fields in total per CSV and we narrowed down about 8 that we need to filter for. From what I read completely removing the fields from ingestion would also slow down ingestion so I've just been using File>CSV>columns.
No, they're just the 8 most important columns we are trying to grab. Thank you for the information though as that was something I sort of noticed while ingesting but didn't really know how to portray it. I had noticed that while the columns I was filtering for were getting defined the rest of the columns were just being ingested but not defined.
Originally I was using the auto_name_detect setting to just ingest everything with a name but was told by some peers that it would be faster to just filter for the desired columns. Is that true though or would it be the same regardless of filtering?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.