Read and index multiple CSV files

manideep · June 3, 2020, 6:45pm

Hi Community,

I am working on a task to read multiple CSV files from path and index them in Elasticsearch. all csv files are independent of each other.
I am using python. I appreciate if there are any alternate methods.

example:

csv files:

test1.csv
test2.csv
test3.csv
.........

indices:

test1
test2
test3
.......

below is my code. my code seems working for one file and not working for multiple csv files.
the issue is at the helpers.bulk statement. I commented that statement and verified looping through all csv files.

also, because of the nulls values, i see below error which is fine. all nulls are ignored and indexed for now. But is this really causing issue to next all CSVs?

'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': {'type': 'json_parse_exception', 'reason': "Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@69daf1f8; line: 1, column: 33]"}}

# import Elasticsearch module

from elasticsearch import Elasticsearch

from elasticsearch import helpers

import pandas as pd

import glob

# read csv file

path = "C:/Users/shanuma3/Desktop/CSV/"

#file = "bigmart_data1.csv"

files = glob.glob(path + "*.csv")

rows = 10000

# create connection

es = Elasticsearch([{'host':'localhost','port':9200}])

for file in files:

    df = pd.read_csv(file, nrows=rows)

    index_name = file.split("\\")[1][:-4]   

    documents = df.to_dict(orient='records')

    print("Index created: " + index_name)

    es.indices.create(index = index_name)

    print("Indexing Start: " + index_name)

    #print(documents)

    helpers.bulk(es, documents, index = index_name, doc_type='_doc', raise_on_error=True)

    print("Index finished:" + index_name)

manideep · June 4, 2020, 2:18pm

Hi All,

I think I figured out the solution. Since I am using pandas, I am converting 'Nan' to nulls using the below piece of code.

df = df.where(pd.notnull(df), None)

system · July 2, 2020, 2:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch index doesn't show all CSV data at once Elasticsearch	7	444	July 10, 2019
Multiple CSV files need to upload for same column names Logstash	10	2520	February 6, 2018
Logstash does not transfer all CSV files to an Elasticsearch index Logstash	2	287	July 10, 2019
Load multiple CSV to multiple index in a single conf file Logstash	2	780	March 24, 2020
How do I take multiple CSV files into a single Elasticsearch index using logstash? Logstash	1	1360	May 16, 2020

Read and index multiple CSV files

Related topics