Missing documents when using re-index API

Hello,
I'm facing an issue while using re-index API.
Currently we have 1 index per day of data which leaves is with just 100 days of data visible on the dashboard (cuz of the data view limitation) so I was trying to re-index my data to be one index per month.

No matter how many times i repeat the process I always end up with missing documents, tried to re-index 10 days only and still got missing data (5,568,085 documents before vs 5,567,293 documents after reindexing)

I'm using kibana dev tools so I dont know how to check for errors that may have happened during indexing, Also I did not set mappings for the new index, it is automatically created by elasticsearch.

ELK stack version ==> 8.13.2 | OS ==> Windows

Command ==>

POST _reindex?wait_for_completion=false
{
"source": {
"index": "bkgroupindex-2024.01.1*"
},
"dest": {
"index": "bkgroupindex_jan2-2024.01"
}
}

Sample comparison ==>

Is it the same mapping?
Are you checking the output of the reindex operation?

Hello David,
Yes I have checked the mappings for both and they are identical,
I checked the output using
"GET _tasks/<task_id>"
And this what i got :

{
  "completed": true,
  "task": {
    "node": "-ZXsTSb0S96MbhLlrdEw4g",
    "id": 35680275,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": 5568085,
      "updated": 792,
      "created": 5567293,
      "deleted": 0,
      "batches": 5569,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 0,
      "requests_per_second": -1,
      "throttled_until_millis": 0
    },
    "description": "reindex from [bkgroupindex-2024.01.1*] to [bkgroupindex_jan2-2024.01]",
    "start_time_in_millis": 1715072504489,
    "running_time_in_nanos": 390270396700,
    "cancellable": true,
    "cancelled": false,
    "headers": {
      "trace.id": "4598c3e599c4b5965f126804c4cdc06d"
    }
  },
  "response": {
    "took": 390269,
    "timed_out": false,
    "total": 5568085,
    "updated": 792,
    "created": 5567293,
    "deleted": 0,
    "batches": 5569,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled": "0s",
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until": "0s",
    "throttled_until_millis": 0,
    "failures": []
  }
}

Since there are 0 failures in the operation, could this be related to the document id / unique field in logstash? Maybe elasticsearch is overwriting some records.

NOTE : not all days have missing data, some daily indices were copied correctly into the new one.

I'm curious about:

"updated": 792,

You are not writing to a new index but to an existing index?

1 Like

No its a new index.
Could it be overwriting records?

Did you let Elasticsearch assign document IDs for documents in these indices or have you specified them elsewhere?

It is configured in logstash ==> document_id => "%{seqid}-%{node}"

We are analyzing calls in an IVR system, each call has a unique seqid and we need to count hits on each node. Since we have a pipeline for each node and you pass through multiple nodes during one call, we end up with multiple records with the same seqid so we had to make a unique combination for each record.

Thought it is working properly in logstash (old indices) and the records are identical to that in the database, I don't know why it is overwritten during indexing.

You are merging multiple indices into one so if you have documents with the same ID going to different indices when using the old naming convention, these would result in updates and show up as deleted documents when merged into a single index.

I suspect that is the reason you are seeing what you are seeing.

You mean that I already have duplicates in the data right?
But I thought specifying "document_id" in the configuration will make elasticsearch overwrite same documents if they exist instead of creating new ones.

Is there a way to know the IDs for the updated documents?

Yes.

It does, but only if the documents go to the same index. If you happen to be writing documents with the same ID to different indices (that you are now merging) you would experience updates and see deleted documents in the index stats.

Thnx alot.