Fastest way to ingest CSV's with logstash to elasticsearch

Security_Check · May 10, 2023, 7:26pm

I'm currently trying to ingest 100gb of csv files into elasticsearch through logstash. The issue is it's taking forever. I have narrowed down the columns I'm trying to filter for to 8 out of 71 but it still takes a long time to ingest them. Is there a faster set up I could use? I have my batch size set to 10000, workers set to 8, and this is my .conf file I'm using:

input {
  file {
    path => "C:/sampledata2/*.csv"
	start_position => "beginning"
  }
}
filter {
  csv {
    separator => ","
    columns => [
      "Action",
      "Receive Time",
      "Source address",
      "Source Port",
      "Destination address",
      "Destination Port",
      "Bytes Received",
      "Bytes Sent"
    ]
  }
}
output {
  elasticsearch {
    hosts => ["http://localhost:9200"]
    index => "appentest3"
    user => "elastic"
    password => ""
  }
  stdout{codec => rubydebug}
}

Any help would be greatly appreciated

warkolm · May 10, 2023, 11:59pm

Welcome to our community!

How long is it taking? Have you tried removing the stdout section, that's likely to be slowing things down.

Security_Check · May 11, 2023, 2:07am

Thanks! my team at work has sort of made me the defacto ELK guy so I've been trying to get a something put together that would allow us to upload a lot of files to search through. I would say the current speed is about 5mb/min to 1gb/hr. Currently that amounts to about 1 file every 5 minutes or so and since we have hundreds of files we're trying to ingest it just isn't optimal enough yet.

I haven't tried removing stdout yet since I have been sort of using it to gauge the current progress of the upload, and as a quick confirmation files are ingesting. Do you really think it could be slowing things down that much? I just feel like I've got to be doing something wrong for it to be ingesting so slowly, and assume there's gotta be a setting(s) I can change to optimize it.

I can provide further details/settings if needed.

warkolm · May 11, 2023, 2:11am

100% it is, you're echoing everything out to the console. If you really want to keep track of where things are at it might be easier to put the output into a file and then use linux tools to view that instead.

What is the mapping for the index? How many shards does it have? How large is your Elasticsearch cluster?

stephenb · May 11, 2023, 2:31am

@Security_Check
Are you saying your file only has 8 columns? or still has 71?

Dissect filter often far faster than csv filter if you have 8 columns that should fly.

Other considerations are what Warkolm said.

And yes the stdout will slow down significantly... just watch the number of docs in. Elastic.

Security_Check · May 11, 2023, 3:23am

Ah that makes a lot of sense, I'll go ahead and try removing the stdout. Also it's being ran on VM locally with windows as the OS. The plan is to move to a clustered model with 2 VM's running Elasticsearch, 1 running Kibana and another running logstash but we're low on storage space and man power at the moment.

In regards to mapping I'm still new to ELK and don't think I've made any changes unless defining CSV as the filter type counts, otherwise it would just be default mapping I'd assume and I'm only running one shard at the moment. I haven't configured anything out past trying to run multiple pipelines and messing with the batch_size, workers, and heap size. Also with the lack of available VM's everything is being ran locally on a single VM in the E drive.

Security_Check · May 11, 2023, 3:26am

Apologies, to clarify there are 71 fields in total per CSV and we narrowed down about 8 that we need to filter for. From what I read completely removing the fields from ingestion would also slow down ingestion so I've just been using File>CSV>columns.

Badger · May 11, 2023, 2:57pm

Are those 8 columns the first 8 columns? There is no way to ask a csv filter to parse a subset of the columns in a row.

Security_Check · May 11, 2023, 3:19pm

No, they're just the 8 most important columns we are trying to grab. Thank you for the information though as that was something I sort of noticed while ingesting but didn't really know how to portray it. I had noticed that while the columns I was filtering for were getting defined the rest of the columns were just being ingested but not defined.

Originally I was using the auto_name_detect setting to just ingest everything with a name but was told by some peers that it would be faster to just filter for the desired columns. Is that true though or would it be the same regardless of filtering?

system · June 8, 2023, 3:20pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help Needed in improving the data ingestion time Logstash	5	743	December 13, 2017
Import (21gb) csv to elasticsearch Elasticsearch	9	517	February 1, 2019
Performance issues while importing CSV files into Elasticsearch Logstash	2	765	September 6, 2018
Slow Data loading to elasticsearch Logstash	15	5227	July 13, 2017
Logstash csv ingestion with single row updates frequently Logstash	2	675	November 12, 2017

Fastest way to ingest CSV's with logstash to elasticsearch

Related topics