Count of docs after _reindex'ing higher than before

Hello,

I've recently had to reindex all docs due multi type mappings being deprecated.

There is one observation that I can't explain...
After reindexing (via _reindex API) I found that new index has more documents than the original one. How is that possible? I narrowed down which documents were new and I tried to search for them in the old index but with no luck...

How did you get the count of documents?

I used Kibana's Discovery page which used absolute time frame.

What do you get when running this in Dev Tools?

GET _cat/indices?v

Hmmm... I think the comparison of document count using the above won't work because old index has got multi type mapping.

Ok, then what about the following?

curl -XGET http://xx.xx.xx:9200/source-index/source-type/_count?pretty
curl -XGET http://xx.xx.xx:9200/dest-index/_count?pretty

Note that the count we'll see doesn't not include nested documents. Do you have nested types in your mapping?

No nested types in our mapping. So example result is:

Old index: 1573300
New index: 1573367

Since the difference is pretty low, it would be interesting to proceed by dichotomy using a date_histogram aggregation on both indices and see in which buckets the differences are. Can you run this aggregation on both of your indices and see in which month (you might use year or day as well) the differences appear, then we can further drill down, until we find the culprit.

{
  "size": 0,
  "aggs": {
    "dichotomy": {
      "date_histogram": {
        "field": "your_date_field",
        "interval": "month"
      }
    }
  }
}

Our indices are monthly already. I did a visualisation which split data on _index term and nearly every index has got some discrepancies.

I also used your suggested script and split yearly data (so e.g. "index-2015*") by monthly interval and nearly every month has got higher count in the new index.

Ok, then let's take one month and drill down, by day, hour, minute... until we find one doc that is in the destination index but not in the source index.

Found an example of a doc that is not listed in the source index but is in the target index.

Good! And what does it tell us? Do you have any idea? Do you want to share it?

I won't be able to share it as it has information about a customer and a specific order that was placed. What I tried to do was to pick some potentially unique fields, e.g. order value, channel id it came from, and do a search globally, hoping that something would be found but with no success.

How do you call the reindex API? Do you mind sharing your command? Are you using an alias as the source index ?

Sure, that I can do.

POST _reindex
{
"source": {
"index": source_index,
"type": "order"
},
"dest": {
"index": target_index,
"pipeline": "rename_order_fields"
},
"script": {
"lang": "painless",
"source": "ctx._source.remove('account_is_subscriber');ctx._source.remove('item_count_stock_tracked');"
}
}

Where rename_order_fields pipeline is used to remove order_ prefix from some fields that changed over time. E.g.

PUT _ingest/pipeline/rename_order_fields
{
"description" : "rename order fields",
"processors" : [
{
"rename": {
"field": "order_value_USD",
"target_field": "value_USD"
}
}
etc ...
}

Also, not using aliases.

Is it possible that the document has been deleted in the source index in the meantime (or during the reindex)? Do you have a process that deletes (old/rotten) documents based on some condition?

Nope, we don't delete any documents.

What does the diff between the document in the new index and the old index tell us?

Sorry, not sure what you mean by diff. That document doesn't seem to exist - or at least can't get it to surface - in the old index, so other than the fact it exists in one index but not in the other there is nothing else I can compare.

Yes, sorry. Ok, then the ID is not supposed to change between the source and the target index. Any way to find that document by ID in any other index?