Documents being deleted after BulkRequest indexing

vivss · May 17, 2023, 2:59pm

Hi all,

We are using Elasticsearch 7.17.7 and indexing documents via BulkRequest in Java API Client, and we noticed that many documents are being deleted after indexing. We retrieve the records from a Postgresql database and use their primary key, which is a UUID, as the elastic document ID, so we expect the number of records in the database and the final number of indexed documents to be identical. However, after sending 112,938 documents to elastic, 42,662 were deleted and only 70,276 remain in the index (please see the stats file below). Duplicate IDs shouldn't be an issue, since when we query the Postgresql database we in fact get 112,938 records, hence there indeed are this many distinct UUIDs.

What other criteria would elastic be using to delete documents? We tried indexing different databases, but the same happens for all of them. Indexing the same database multiple times also results in different numbers of deleted documents. We are at a loss about why this is happening, so any insight is welcome.

{
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "stats": {
    "uuid": "__l4nQu3TgaegNNbX7S3Rg",
    "primaries": {
      "docs": {
        "count": 70276,
        "deleted": 42662
      },
      "shard_stats": {
        "total_count": 1
      },
      "store": {
        "size_in_bytes": 35565294,
        "total_data_set_size_in_bytes": 35565294,
        "reserved_in_bytes": 0
      },
      "indexing": {
        "index_total": 112938,
        "index_time_in_millis": 4884,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0
      },
      ...
    }
  }
}

Christian_Dahlqvist · May 17, 2023, 3:13pm

The sum of count and deleted is exactly the number of rows sent in. The deleted number indicates an update. I would recommend looking at a few documents in Elasticsearch and verify the ID is what you expect. Is the UUID field the only field that is part of the primary key? Does the query you run include any join that could cause a UUID to appear more than once? What do you get if you do a SELECT COUNT DISTINCT on the UUID column?

vivss · May 17, 2023, 6:13pm

Hi Christian,

Thanks for your reply. Answering your questions:

Yes, the IDs of the documents in Elastic are exactly what I expect, that is, the same UUIDs coming from the database. Please see the picture below, with a few lines from the _id column in Kibana, and from the UUID column in the database, side by side

elastic_vs_dbtable710×468 37 KB
Yes, the UUID field is the only part of the primary key
No, the query does not include any join
A SELECT COUNT DISTINCT on the UUID column returns 112,938

We've checked all this before, that's why we're clueless about the deletions.

Christian_Dahlqvist · May 17, 2023, 6:24pm

Elasticsearch does not automatically delete anything, which is why I suspected duplicates.

The deleted document count can change when merging occurs in the background, so may not always show the full number of deletes.

To troubleshoot this further I would do two things:

Ensure you use a create action rather than an index action in your bulk requests. If there for some reason are something that is considered a duplicate by Elasticsearch this will be indicated in the respons, so make sure to check this for errors and log whatever you find.
Try to identify some documents that are missing from Elasticsearch and see if there is any pattern with respect to the records that in the index compared to the ones that are missing.

vivss · May 24, 2023, 7:43pm

Hi,

Took some time to analyze the code and the records, but reached no conclusion...

You say

and the response says nothing but that the request was successful, no errors and no further information. Another evidence there are no errors is that all records are sent and received, as confirmed by the "index_total" in the stats file.
Also, what do you mean by "If there for some reason are something that is considered a duplicate", if, as I said, I can ensure there are no duplicate IDs? I go back to my first question: what other reasons, besides equal IDs, would result in duplicates?

I checked a number of records, both indexed and not indexed, and could identify no pattern at all.

leandrojmp · May 24, 2023, 11:43pm

I don't think there are other reasons, a duplicate in elasticsearch would happens when you try to index a document with the same _id.

As Christian said, Elasticsearch will not delete any document by itself, if you have deleted documents this means that your code tried to index documents with the same _id and this lead elasticsearch to updated the previous indexed document, which means delete the old document and index the new one.

But since you have 112938 unique UUIDs, you should also get 112938 unique documents in elasticsearch, the fact that you are getting some deleted documents and a lower number of documents in elasticsearch and indexing the same database multiple times results in different numbers of deleted documents this could mean that you have some issue in your code that may be trying to index different documents but using the same _id.

Can you share your code or at least the parts related to the indexing?

Have you tried to change your code to index your documents without using a custom _id, but let elasticsearch set the document id to see how many documents you will end up?

vivss · May 25, 2023, 1:49pm

Hi Leandro,

The code is public on Github:

github.com

lareferencia/lareferencia-entity-lib/blob/main/src/main/java/org/lareferencia/core/entity/indexing/elastic/JSONElasticEntityIndexerImpl.java


/*
 *   Copyright (c) 2013-2022. LA Referencia / Red CLARA and others
 *
 *   This program is free software: you can redistribute it and/or modify
 *   it under the terms of the GNU Affero General Public License as published by
 *   the Free Software Foundation, either version 3 of the License, or
 *   (at your option) any later version.
 *
 *   This program is distributed in the hope that it will be useful,
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *   GNU Affero General Public License for more details.
 *
 *   You should have received a copy of the GNU Affero General Public License
 *   along with this program.  If not, see <http://www.gnu.org/licenses/>.
 *
 *   This file is part of LA Referencia software platform LRHarvester v4.x
 *   For any further information please contact Lautaro Matas <lmatas@gmail.com>
 */

This file has been truncated. show original

It's part of a (much) larger framework and performs a kind of ETL procedure, so please let me know if you need further explanation on it.

DavidTurner · May 25, 2023, 1:56pm

the IDs of the documents in Elastic are exactly what I expect

Are you sure? It looks like 0002ead6-4606-4042-9ac6-0788ffc1864b9 and several others are missing.

vivss · May 25, 2023, 2:04pm

I mean their shape is as I expect them to be, that is, they are formatted as a UUID, precisely as I intended they to be. Yes, there are a lot of missing IDs (and hence missing documents), this is exactly the problem I'm reporting.

DavidTurner · May 25, 2023, 2:17pm

I see, ok, if you only ever added documents with ID 0002ead6-4606-4042-9ac6-0788ffc1864b9 (even with duplicates) and at least one of those indexing requests succeeded, then Elasticsearch would have a document with that ID. So either it's not being indexed successfully with that ID, or something else is deleting it.

Christian_Dahlqvist · May 25, 2023, 2:18pm

Send bulk requests with the create action instead of index action. This will force an error instead of an update if there are duplicates and should help you troubleshoot as long as you analyse and log response errors.

stephenb · May 25, 2023, 3:06pm

@vivss I do notice you are using the OpenSearch Clients Perhaps an issue, pretty sure you should be using our official clients assuming your target is actually Elasticsearch

import org.opensearch.action.bulk.BulkRequest;
import org.opensearch.action.bulk.BulkResponse;
import org.opensearch.action.index.IndexRequest;
import org.opensearch.client.RequestOptions;
import org.opensearch.client.RestClient;
import org.opensearch.client.RestClientBuilder;
import org.opensearch.client.RestHighLevelClient;
import org.opensearch.client.indices.CreateIndexRequest;
import org.opensearch.client.indices.CreateIndexResponse;
import org.opensearch.client.indices.GetIndexRequest;
import org.opensearch.common.xcontent.XContentType;

system · May 25, 2023, 3:06pm

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

system · June 22, 2023, 3:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Missing documents after a bulk index Elasticsearch	13	3365	July 6, 2017
Elasticsearch not showing correct count of documents in index Elasticsearch	10	486	June 13, 2023
Documents in being deleted after indexed Elasticsearch	1	69	July 1, 2024
Why my elasticsearch index has a lot deleted doc? Elasticsearch	5	520	June 6, 2019
Document lost or not indexed during bulk index Elasticsearch	4	1647	July 23, 2020

Documents being deleted after BulkRequest indexing

Related topics