Pyes bulk insert problem

Hi,

I have a 200M documents index I would like to reindex.
I wrote the following script that goes over documents in the old index and
puts them with balk insert into the new index.
The size of each bulk is 2000 documents.

search_obj = pyes.query.Search(query = pyes.query.MatchAllQuery(), start=
resume_from)

old_index_iterator = self.esconn.search(search_obj, self.index_name)
counter = 0
BULK_SIZE = 2000

for doc in old_index_iterator:
self.esconn.index(doc=doc, doc_type=DOC_TYPE, index=new_index_name, id=
doc.get_id(),* bulk=True*)
counter += 1

if counter % BULK_SIZE == 0:
self.logger.debug("Refreshing...")

  • **self.esconn.refresh()**
    
  • self.logger.debug("Refresh done.")
    

self.esconn.refresh()

Observation:

  1. The speed that I get is very slow: around 150 documents / minute.
  2. The time of the refresh operation is 0.
  3. If I remove the index command (just read from the DB) - I speed up 10
    times.

Conclusion:
The index ignores the *bulk=True *flag, and pushes every single document to
the ES server.

Anyone know please help me to figure out why *bulk=True *has no effect?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

first you need to pass bulk_size on connection creation
refresh is not required. do it at the end of indexing process.
before bulk indexing reduce the refresh_interval to -1 in the index settings. restore it at the end.
bulk_index should work

Inviato da iPhone

Il giorno 11/giu/2013, alle ore 08:28, Dmitry Babitsky dimok21@gmail.com ha scritto:

Hi,

I have a 200M documents index I would like to reindex.
I wrote the following script that goes over documents in the old index and puts them with balk insert into the new index.
The size of each bulk is 2000 documents.

search_obj = pyes.query.Search(query = pyes.query.MatchAllQuery(), start=resume_from)

old_index_iterator = self.esconn.search(search_obj, self.index_name)
counter = 0
BULK_SIZE = 2000

for doc in old_index_iterator:
self.esconn.index(doc=doc, doc_type=DOC_TYPE, index=new_index_name, id=doc.get_id(), bulk=True)
counter += 1

if counter % BULK_SIZE == 0:
self.logger.debug("Refreshing...")
self.esconn.refresh()
self.logger.debug("Refresh done.")

self.esconn.refresh()

Observation:

  1. The speed that I get is very slow: around 150 documents / minute.
  2. The time of the refresh operation is 0.
  3. If I remove the index command (just read from the DB) - I speed up 10 times.

Conclusion:
The index ignores the bulk=True flag, and pushes every single document to the ES server.

Anyone know please help me to figure out why bulk=True has no effect?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Alberto,

I've implemented your suggestion (bulk_size at connection creation +
update_settings on refresh_interval to -1 on the new index), but
unfortunately it did not help.

On Tuesday, June 11, 2013 8:13:05 PM UTC+3, Alberto Paro wrote:

first you need to pass bulk_size on connection creation
refresh is not required. do it at the end of indexing process.
before bulk indexing reduce the refresh_interval to -1 in the index
settings. restore it at the end.
bulk_index should work

Inviato da iPhone

Il giorno 11/giu/2013, alle ore 08:28, Dmitry Babitsky <dim...@gmail.com<javascript:>>
ha scritto:

Hi,

I have a 200M documents index I would like to reindex.
I wrote the following script that goes over documents in the old index and
puts them with balk insert into the new index.
The size of each bulk is 2000 documents.

search_obj = pyes.query.Search(query = pyes.query.MatchAllQuery(), start=
resume_from)

old_index_iterator = self.esconn.search(search_obj, self.index_name)
counter = 0
BULK_SIZE = 2000

for doc in old_index_iterator:
self.esconn.index(doc=doc, doc_type=DOC_TYPE, index=new_index_name,id
=doc.get_id(),* bulk=True*)
counter += 1

if counter % BULK_SIZE == 0:
self.logger.debug("Refreshing...")

  • **self.esconn.refresh()**
    
  • self.logger.debug("Refresh done.")
    

self.esconn.refresh()

Observation:

  1. The speed that I get is very slow: around 150 documents / minute.
  2. The time of the refresh operation is 0.
  3. If I remove the index command (just read from the DB) - I speed up 10
    times.

Conclusion:
The index ignores the *bulk=True *flag, and pushes every single document
to the ES server.

Anyone know please help me to figure out why *bulk=True *has no effect?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.