Csv parse taking too long using python API

Hello! That's my first post here in elastic discussion area, sorry if any mistake were made.

I'm kinda new using ELK Stack, I'm trying to parse a .CSV file to ElasticSearch using the Python API. The thing is, it's taking way too long to parse just a few logs (312seconds to parse 30000 logs). I'm using ElasticSearch 5-6-3 running on Ubuntu 16.04, 6GB RAM
The main idea is to convert each row into a json and then parse it to Elastic, the code:

import time
import json
import pandas as pd
from elasticsearch import Elasticsearch

class Storage(object):
    
    def __init__(self,user,password):
        
        self.user = user
        self.password = password
        self.es = Elasticsearch(http_auth=(user, password))
        
    def get_info(self, log=False):
        
        info = self.es.info()
        if info:
            print(json.dumps(info, indent=3))
            return info
    
    def index(self, index, doc_type, _id, json):
    
        status = self.es.index(index=index, doc_type=doc_type, id = _id, body=json)
        return status

        
if __name__ == '__main__':
    
    st = Storage('elastic','changeme')
    print(st.get_info())
    df  = pd.read_csv(file_to_parse, low_memory=False)
    aux = df.to_dict('records')
    
    index = 1
    begin = time.time()
    for register in aux:
        
        reg = json.dumps(register)
        res = st.index('someindex','log',index,reg)
        index += 1
        
    print("Finished", time.time() - begin)

What could possibly be improved here?

Use the bulk API to send multiple documents per indexing request (docs). This is much more efficient than indexing each document individually.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.