Taking too much time to index

munotshubham · June 1, 2017, 2:06pm

So I'm trying to index a DB with 91 columns and around 300,000 rows. It is in a CSV file and I'm using logstash to load it in ES.

It's been running since 20 hours, it still hasn't been indexed.
For test purposes I had taken just 10 rows and indexed it. It was working fine.

I have ran logstash on debug mode

10:00:34.221 [Ruby-0-Thread-11: /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:532] DEBUG logstash.pipeline - Pushing flush onto pipeline
10:00:34.965 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:35.966 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:36.967 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:37.968 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:38.969 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:39.220 [Ruby-0-Thread-11: /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:532] DEBUG logstash.pipeline - Pushing flush onto pipeline
10:00:39.970 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:40.972 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:41.973 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:42.975 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:43.976 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:43.977 [[main]<file] DEBUG logstash.inputs.file - _globbed_files: /home/patagonia/Documents/testserver-patients-May31.csv: glob is: ["/home/patagonia/Documents/testserver-patients-May31.csv"]

This is repeating since 20 hours. I've no clue how much has been indexed till now.
There is no new folder in /var/lib/elasticsearch/nodes/0/indices apart from the indices that are already there on localhost:9200/_cat/indices

Christian_Dahlqvist · June 1, 2017, 2:25pm

What does you Logstash config look like?

munotshubham · June 1, 2017, 2:40pm

input {
  file {
    path => "/home/patagonia/Documents/testserver-patients*.csv"
  }
}
filter {
  csv {
    columns => ["PatientID", "PaperChartNumber", "PMRecorndNumber", "SubscriberID", "PracticeID", "UserID", "PatientType", "StartDate", "EndDate", "IsActive", "IsDeleted", "IsReportable", "Title", "FirstName", "MiddleName", "LastName", "Suffix", "PreferredName", "GuardianName", "MaidenName", "DateOfBirth", "DateOfDeath", "Gender", "MaritalStatus", "RaceCode", "LanguageCode", "InsuranceType", "PharmacyName", "PreferredContact", "InactiveReason", "BloodType", "AddressLine1", "AddressLine2", "AddressLine3", "City", "State", "Zipcode", "HomePhoneNumber", "WorkPhoneNumber", "MobileNumber", "EmailAddress", "PatientPhotoLocation", "MergedPatientID", "InsertedBy", "InsertDate", "LastEditedBy", "LastEditDate", "EthnicGroup", "ReferringPhysician", "ReferringPhysicianPhone", "ReferringPhysicianFax", "PharmacyPhone", "PharmacyFax", "ReferringPhysicianCity", "PharmacyCity", "PCPName", "PCPPhone", "PCPFax", "PCPCity", "AltPhone1", "AltPhone2", "InsuranceID", "PMPatientID", "PMCaseName", "PatInsProfileID", "PatientInsuranceID", "PatientInsuranceName", "Employer", "Employment", "LocationID", "Comments", "PatientState", "PatientStateDate", "County", "RefPhysicianID", "CNDSID", "SSN", "Nosnailmail", "NeedsInterpreter", "CountryCode", "Veteranstatus", "WCServedIn", "Driverlicense", "Cl"]
  }
}
output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "patient"
  }
}

Christian_Dahlqvist · June 1, 2017, 2:46pm

What does resource utilisation, particularly CPU and disk I/O, look like on the host while indexing? How many CPU cores do you have available?

munotshubham · June 1, 2017, 3:16pm

How do I check all that?

I cannot see all these information using top command on terminal.

munotshubham · June 2, 2017, 1:33am

Hey,
I figured out the solution.
Basically logstash works similar to beats agent, Harvestor had already indexed the data once, so it wasn't indexing the same file. So I had to create a new file with the same content to get it indexed.

system · June 30, 2017, 1:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to speed up indexing of csv file via logstash Logstash	2	424	October 1, 2019
Reading CSV and applying indexing from logstash taking too much time Logstash	4	314	April 8, 2019
Logstash taking too long to run Logstash	8	1226	May 18, 2020
How can i accelarete loading csv to logstash Logstash	2	251	March 16, 2020
Consuming 3x 2GB text logs Logstash	8	394	November 13, 2019

Taking too much time to index

Related topics