Helpers.parallel_bulk in Python not working?

Patrick_Lam · January 19, 2016, 3:52am

Hi,

I'm trying to test out the parallel_bulk functionality in the python client for elasticsearch and I can't seem to get helpers.parallel_bulk to work.

For example, using the regular helpers.bulk works:

bulk_data = []
header = data.columns
for i in range(len(data)):
    source_dict = {}
    row = data.iloc[i]
    for k in header:
        source_dict[k] = str(row[k])
    data_dict = {
        '_op_type': 'index',
        '_index': index_name,
        '_type': doc_type,
        '_source': source_dict
    }
    bulk_data.append(data_dict)

es.indices.create(index=index_name, body=settings, ignore=404)
helpers.bulk(client=es, actions=bulk_data)
es.indices.refresh()
es.count(index=index_name)

{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'count': 13979}

But replacing it with helpers.parallel_bulk doesn't seem to index anything:

bulk_data = []
header = data.columns
for i in range(len(data)):
    source_dict = {}
    row = data.iloc[i]
    for k in header:
        source_dict[k] = str(row[k])
    data_dict = {
        '_op_type': 'index',
        '_index': index_name,
        '_type': doc_type,
        '_source': source_dict
    }
    bulk_data.append(data_dict)
es.indices.create(index=index_name, body=settings, ignore=404)

helpers.parallel_bulk(client=es, actions=bulk_data, thread_count=4)
es.indices.refresh()
es.count(index=index_name)

{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'count': 0}

Am I missing something? I'm on elasticsearch 2.1.1 with elasticsearch-py 2.1.0.

honzakral · January 19, 2016, 4:55pm

Hi,

parallel bulk is a generator, meaning it is lazy and won't produce any results until you start consuming them. The proper way to use it is:

for success, info in parallel_bulk(...):
    if not success:
        print('A document failed:', info)

If you don't care about the results (which by default you don't have to since any error will cause an exception) you can use the consume function from itertools recipes (https://docs.python.org/2/library/itertools.html#recipes):

from collections import deque
deque(parallel_bulk(...), maxlen=0)

Hope this helps.

honzakral · January 19, 2016, 5:01pm

Btw the reason it is lazy is so that you never have to materialize a list with all the records, which can be potentially very expensive (for example when inserting data from a DB, long file or doing a reindex). It also means that you don't have to pass in a list, you can pass in a generator, thus avoiding creating a huge in-memory list yourself. In your example you could have a generator function:

def genereate_actions(data):
    for i in range(len(data)):
        source_dict = {}
        row = data.iloc[i]
        for k in header:
            source_dict[k] = str(row[k])
        yield {
            '_op_type': 'index',
            '_index': index_name,
            '_type': doc_type,
            '_source': source_dict
        }

and then call:

for success, info in parallel_bulk(es, genereate_actions(data), ...):
    if not success: print('Doc failed', info)

which will avoid the need to have all the documents present in memory at any given time. When working with larger dataset it can be a significant memory saving!

xamox · April 15, 2016, 3:41pm

Thanks, I was having same issue. Not sure why this isn't called out in the docs.

oabio · May 17, 2016, 12:38am

I am following your recommendation to index a very large file which takes hours. However, along the way, the memory foot print increases until it starts swapping. Internally, I am not saving anything and memory profile shows that it has something to do with parallel_bulk, but I cannot get deeper than that. Am I missing something here? Does this makes any sense to you at all? I am using Python 2.7.5 with elasticsearch 2.3.0

Thank you for your help

minafarid · June 14, 2016, 8:16pm

This DEFINITELY needs to be in the documentation!!!!!

Topic		Replies	Views
How to add refresh option in python elasticsearch parallel_bulk helper Elasticsearch	1	594	February 14, 2020
Python parallel_bulk, info says no errors, but nothing in index Elasticsearch	2	299	September 24, 2021
How to use parallel_bulk function Elasticsearch	7	11324	June 17, 2019
Python Bulk Helper Load With Multiple Indices? Elasticsearch	1	623	April 11, 2017
Upload data via python bulk helper caused error Elasticsearch	3	1678	July 5, 2017

Helpers.parallel_bulk in Python not working?

Related topics