Taking too much time to index


(Shubham Munot) #1

So I'm trying to index a DB with 91 columns and around 300,000 rows. It is in a CSV file and I'm using logstash to load it in ES.

It's been running since 20 hours, it still hasn't been indexed.
For test purposes I had taken just 10 rows and indexed it. It was working fine.

I have ran logstash on debug mode

10:00:34.221 [Ruby-0-Thread-11: /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:532] DEBUG logstash.pipeline - Pushing flush onto pipeline
10:00:34.965 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:35.966 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:36.967 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:37.968 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:38.969 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:39.220 [Ruby-0-Thread-11: /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:532] DEBUG logstash.pipeline - Pushing flush onto pipeline
10:00:39.970 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:40.972 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:41.973 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:42.975 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:43.976 [[main]<file] DEBUG logstash.inputs.file - each: file grew: /home/patagonia/Documents/testserver-patients-May31.csv: old size 0, new size 356553232
10:00:43.977 [[main]<file] DEBUG logstash.inputs.file - _globbed_files: /home/patagonia/Documents/testserver-patients-May31.csv: glob is: ["/home/patagonia/Documents/testserver-patients-May31.csv"]

This is repeating since 20 hours. I've no clue how much has been indexed till now.
There is no new folder in /var/lib/elasticsearch/nodes/0/indices apart from the indices that are already there on localhost:9200/_cat/indices


(Christian Dahlqvist) #2

What does you Logstash config look like?


(Shubham Munot) #3
input {
  file {
    path => "/home/patagonia/Documents/testserver-patients*.csv"
  }
}
filter {
  csv {
    columns => ["PatientID", "PaperChartNumber", "PMRecorndNumber", "SubscriberID", "PracticeID", "UserID", "PatientType", "StartDate", "EndDate", "IsActive", "IsDeleted", "IsReportable", "Title", "FirstName", "MiddleName", "LastName", "Suffix", "PreferredName", "GuardianName", "MaidenName", "DateOfBirth", "DateOfDeath", "Gender", "MaritalStatus", "RaceCode", "LanguageCode", "InsuranceType", "PharmacyName", "PreferredContact", "InactiveReason", "BloodType", "AddressLine1", "AddressLine2", "AddressLine3", "City", "State", "Zipcode", "HomePhoneNumber", "WorkPhoneNumber", "MobileNumber", "EmailAddress", "PatientPhotoLocation", "MergedPatientID", "InsertedBy", "InsertDate", "LastEditedBy", "LastEditDate", "EthnicGroup", "ReferringPhysician", "ReferringPhysicianPhone", "ReferringPhysicianFax", "PharmacyPhone", "PharmacyFax", "ReferringPhysicianCity", "PharmacyCity", "PCPName", "PCPPhone", "PCPFax", "PCPCity", "AltPhone1", "AltPhone2", "InsuranceID", "PMPatientID", "PMCaseName", "PatInsProfileID", "PatientInsuranceID", "PatientInsuranceName", "Employer", "Employment", "LocationID", "Comments", "PatientState", "PatientStateDate", "County", "RefPhysicianID", "CNDSID", "SSN", "Nosnailmail", "NeedsInterpreter", "CountryCode", "Veteranstatus", "WCServedIn", "Driverlicense", "Cl"]
  }
}
output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "patient"
  }
}

(Christian Dahlqvist) #4

What does resource utilisation, particularly CPU and disk I/O, look like on the host while indexing? How many CPU cores do you have available?


(Shubham Munot) #5

How do I check all that?

I cannot see all these information using top command on terminal.


(Shubham Munot) #6

Hey,
I figured out the solution.
Basically logstash works similar to beats agent, Harvestor had already indexed the data once, so it wasn't indexing the same file. So I had to create a new file with the same content to get it indexed.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.