Read and index multiple CSV files

Hi Community,

I am working on a task to read multiple CSV files from path and index them in Elasticsearch. all csv files are independent of each other.
I am using python. I appreciate if there are any alternate methods.

example:

csv files:

  • test1.csv
  • test2.csv
  • test3.csv
  • .........

indices:

  • test1
  • test2
  • test3
  • .......

below is my code. my code seems working for one file and not working for multiple csv files.
the issue is at the helpers.bulk statement. I commented that statement and verified looping through all csv files.

also, because of the nulls values, i see below error which is fine. all nulls are ignored and indexed for now. But is this really causing issue to next all CSVs?

'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': {'type': 'json_parse_exception', 'reason': "Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@69daf1f8; line: 1, column: 33]"}}
# import Elasticsearch module

from elasticsearch import Elasticsearch

from elasticsearch import helpers

import pandas as pd

import glob

# read csv file

path = "C:/Users/shanuma3/Desktop/CSV/"

#file = "bigmart_data1.csv"

files = glob.glob(path + "*.csv")

rows = 10000

# create connection

es = Elasticsearch([{'host':'localhost','port':9200}])

for file in files:

    df = pd.read_csv(file, nrows=rows)

    index_name = file.split("\\")[1][:-4]   

    documents = df.to_dict(orient='records')

    print("Index created: " + index_name)

    es.indices.create(index = index_name)

    print("Indexing Start: " + index_name)

    #print(documents)

    helpers.bulk(es, documents, index = index_name, doc_type='_doc', raise_on_error=True)

    print("Index finished:" + index_name)

Hi All,

I think I figured out the solution. Since I am using pandas, I am converting 'Nan' to nulls using the below piece of code.

df = df.where(pd.notnull(df), None)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.