Hi Community,
I am working on a task to read multiple CSV files from path and index them in Elasticsearch. all csv files are independent of each other.
I am using python. I appreciate if there are any alternate methods.
example:
csv files:
- test1.csv
- test2.csv
- test3.csv
- .........
indices:
- test1
- test2
- test3
- .......
below is my code. my code seems working for one file and not working for multiple csv files.
the issue is at the helpers.bulk statement. I commented that statement and verified looping through all csv files.
also, because of the nulls values, i see below error which is fine. all nulls are ignored and indexed for now. But is this really causing issue to next all CSVs?
'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': {'type': 'json_parse_exception', 'reason': "Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@69daf1f8; line: 1, column: 33]"}}
# import Elasticsearch module
from elasticsearch import Elasticsearch
from elasticsearch import helpers
import pandas as pd
import glob
# read csv file
path = "C:/Users/shanuma3/Desktop/CSV/"
#file = "bigmart_data1.csv"
files = glob.glob(path + "*.csv")
rows = 10000
# create connection
es = Elasticsearch([{'host':'localhost','port':9200}])
for file in files:
df = pd.read_csv(file, nrows=rows)
index_name = file.split("\\")[1][:-4]
documents = df.to_dict(orient='records')
print("Index created: " + index_name)
es.indices.create(index = index_name)
print("Indexing Start: " + index_name)
#print(documents)
helpers.bulk(es, documents, index = index_name, doc_type='_doc', raise_on_error=True)
print("Index finished:" + index_name)