BulkIndexError using Elasticsearch in Python

I am writing a program to search through really large (>400mb) csv files provided by the government. Some of these files have over 1,000,000 rows. This is a local program that roughly 5 people will use in my company to help them do their job better. I am using Python (x64) and have tried the native CSV import and Pandas import. Both methods produce the same result. I can read from a small test file (csv) perfectly fine and input them into the ES Index. But when I attempt to input the the large CSV file I get BulkIndexErrors and nothing indexes. What would be the proper to index a large csv file into Elasticsearch using Python?

with open(editContract.get()) as f:
csv_data = csv.DictReader(f, dialect='excel')
for row in csv_data:
helpers.bulk(es, row, index="contract_search", doc_type='_doc')

raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
elasticsearch.helpers.errors.BulkIndexError: ('45 document(s) failed to index...'reason': 'Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes'

Look at this issue, it looks like most of the time the problem was in the format of the data.

Also, you don't have to send bulk a row at a time, I build data in a loop:

es_out.append(dict(es_row))

Then send it all.

spf_index = bulk(client, es_out)

Of course, within a reasonable number of rows :slight_smile:

Forgive me as I have little experience with Elasticsearch. However, is there a difference in a Python dictionary and the object you are suggesting by appending rows? Am I supposed to transform the data in some way? I don't fully understand what you are suggesting with the code provided.
Also, I looked at the link you provided. I don't see anything particularly helpful there. The code works perfectly fine for small csv files as it is. It does not work for large csv files with thousands-->millions of rows. I don't understand why.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.