Count of docs after _reindex'ing higher than before

Nope. Nothing is found. Houdini document :stuck_out_tongue:

Then, the only way I see this could happen is if some process somewhere indexes a new document in the new index. Is that possible?

We have a process that runs every evening but it updates data for the last month only. It definitely wouldn't be looking at documents that are years old.

You're not using an alias as the source index name, but are you using a wildcard in the name by any chance (it wasn't clear from your example above)?

Nope. All indices follow the pattern index_name-year.month but they are explicitly provided, e.g. orders_prod-2018.01, without any wildcards.

And of course, dump question, but your destination index is completely empty (or even not existing) before you kick off the reindex call?

The index is created just before making the call, so yes - before the call it is completely empty.

Do you mind sharing the response you get from the reindex call?

I often (so far always when reindexing orders) get

{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout"
}
which is something I didn't find the answer for - but the docs still got indexed so I totally forgot about that little fact.

I'm trying to get the example of a non-timing out example...

Note that you can always get the status of an ongoing reindex task even if the browser has timed out by running this:

GET _tasks?detailed=true&actions=*reindex

And I would also add wait_for_completion=false in your reindex call, so that you can query the task status even after the reindex task has finished. And if you do that, you'll need to make sure to delete the task document manually once you've finished inspecting it, e.g.

DELETE /_tasks/taskId:r1A2WoRbTwKZ516z6NEs5A:36619

Hmmm... When the task was running I kept an eye out on its output and there was nothing indicating any issues. After the task completed I couldn't retrieve its details though... Getting

{
"nodes": {}
}

(These _tasks API calls are new to me)

But... The new index now has the same amount of documents like the original one. And the index I used was the one that I earlier established had additional docs - I cleared it out before re-indexing. I did the search for that additional doc but it's not there.

It makes me worried that there might indeed be something changing/removing documents but there is nothing from our side that should be doing it.

See the second part of my previous comment on how to retrieve the task status after the task has finished running.

I'm pretty confident that it is not the reindex process causing the issue you're seeing, but some rogue process somewhere. We'll find out.

I used wait_for_completion=false as suggested. Unless it was supposed to be set to true instead?

You ran this ?

 POST _reindex?wait_for_completion=false

If yes, then in the response you get the task ID that you can then use to query its status like this:

GET _tasks/<taskID>

or like this:

GET .tasks/task/<taskID>

The second query will give you a document containing the task status even after the task has finished. You're supposed to delete it when you don't need it anymore.

Aaah, I see. :slight_smile: All info available now.

{
"completed": true,
"task": {
"node": "dummy node",
"id": 72607957,
"type": "transport",
"action": "indices:data/write/reindex",
"status": {
"total": 646865,
"updated": 646865,
"created": 0,
"deleted": 0,
"batches": 647,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0
},
"description": "reindex from [prod-2015.04][order] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.remove('account_is_subscriber');ctx._source.remove('item_count_stock_tracked');', options={}, params={}} to [orders_prod-2015.04]",
"start_time_in_millis": 1516285516650,
"running_time_in_nanos": 224687658206,
"cancellable": true
},
"response": {
"took": 224687,
"timed_out": false,
"total": 646865,
"updated": 646865,
"created": 0,
"deleted": 0,
"batches": 647,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0,
"failures":
}
}

Ok great job! Now, does 646865 correspond to the number of documents you have in the source index prod-2015.04 (only the order type of course)?

Yes, it does.

There we go, we've been looking at the wrong place all this time :slight_smile:
The reindex process doesn't seem to be the culprit, I'm afraid.

Well, this makes it even greater mystery as other than myself and my manager (who was away while I was re-indexing data) there is no one else who plays with elasticsearch for the project I am in. :joy:

Thank you for your time. I really appreciate it. :slight_smile: