[SOLVED] Elasticsearch python API and reindex module


(alexandre) #1

Hi all,

I have created a python script able to reindex based on list of index.

My problem is the timeout and wait_for_completion option.

When my script launch a reindexation I have to wait until is finished otherwise the reindexation is not performed entirely. For this purpose I need to implement the right global option request_timeout based on index sized and wait the end of the reindexation for each index.

If I put wait_for_completation, I raise an exception and my reindexation failed (I have an index with some kilobyte inside). If I use the timeout option in reindex module (with 5 minutes for example), it fail too (I also get a new index without all my documents inside).

So the only way for me is to use the global request_timeout parameter with value generated depending on index size. But if my index is big, it can take a while.

In my environment, 10 indices took 30 minutes. But for the next time in need to reindex almost 200 indices so it is too long.

If somebody have an idea to play this kind of script in background or something like that.

Thanks in advance,
Alex


(Nik Everett) #2

This is a thing we fixed in 5.0, which isn't related yet. You use
wait_for_completion=false and it gives you back a job. You can then http
GET the job with wait_for_completion=true. If that times out you can just
try again. The fix in 5.0 is that even if the job finishes when you aren't
waiting it'll still return from that GET API.


(alexandre) #3

Thanks for your reply and sorry for my late answer.

It seems that if I put "wait_for_completion=false" during reindexation I can't make another reindexation. If I try to make a new reindexation that stop the first reindexation. It is right ? Or I did something wrong ?

Thanks in advance.


(Nik Everett) #4

Elasticsearch doesn't mind if you have multiple reindex tasks running at one time. If they both try to write to the same place then you are going to have trouble, but that is what the _cancel API is for. Canceling looks like:

curl -XPOST _tasks/{task_id}/_cancel

(alexandre) #5

Thanks but I'm not sure that we are talking about the same subject.

When I make a reindexation with python api of one index (300mb) and the next action in my script is to reindex another index, the first reindexation stop (for the first reindexation, the new index is 3mb for example contrary to 300mb). And the second reindexation task finish correctly (if I have just 2 reindexation).

If I have 50 reindexation tasks in my script, the 49 first tasks don't work correctly, but the 50th reindexation works correctly.

I'm not sure to explain clearly my issue. So sorry about that.

Thanks,
Alex


(Joar Svensson) #6

Have you considered using the Curator? https://github.com/elastic/curator


(alexandre) #7

Hi thanks,

I'm not sure that the curator allow reindexation.
I will check the code anyway.

Alex


(Aaron Mildenstein) #8

Curator is going to have reindex in 4.x (there's already a feature request for it, and I'm actively developing it), but more especially in 5.x, using the Reindex API. It will not do generic reindexing for versions of Elasticsearch without the Reindex API, which was added in Elasticsearch 2.3.


(alexandre) #9

Thanks !


(Nik Everett) #10

That is pretty clear. This sounds like an issue with the python API.


(alexandre) #11

Ok, thanks.


(system) #12