Updating documents when using auto-generated IDs

Hi,

I have an index running on 6.3. It is about 11MM documents, and since the documents are nested, they are much larger in count. These documents represent information stored in multiple databases relating to a specific id (a long int). Since it is unique, that is provided as the _id for the document as well.

I am looking at ways to tune the indexing for this. One thing that caught my eye when reading the guide was to use autogenerated iDs.

Here is my concern.

  1. Would unique IDs such as what I have provided be enough or does the autogenerated ID have provide something much better?
  2. If I were to use autogenerated IDs, how will I update the document using the Update API? I don't have the ID anymore (since that's autogenerated). And I wasn't intending to store the autogenerated IDs in any other database / datastore.
  3. Thinking aloud on the second point, should I be making a retrieval call on Elasticsearch just to check for the autogenerated ID and "upsert" accordingly?

Regards,
Jerry

When you say tuning the index, what exactly are you trying to achieve, or what problem do you have that you are looking to solve?

Updating large nested documents can be expensive irrespective of what type of ID you are using as all nested documents need to be reindexed behind the scenes. Using auto generated ids can speed up indexed no of immutable documents but do not help when updating.

By tuning the index, I was looking for ways on how

  1. I can index the document faster.
  2. How I can consume less resources on indexing.

The above stated is what I really want to solve (all without changing the document structure - size, deep nested .. at least for now)
So as part of that investigation, I assume everything listed in the Elasticsearch suggestions can be made applicable / tried out.

The reason for my post in the forum is if you make use of autogenerated IDs for indexing, how do you update the document for reflecting updates on the database? (Elasticsearch holds data from DB)

If you update data using auto generated IDs does not make sense and will not bring any benefit.

Thank you, that helps clarify my main question.

However, your response brings me to a couple of queries.

  1. "all nested documents need to be reindexed behind the scenes". - I would be happy to read more on this as I am interested to know how costly this operation is.
  2. Is it possible to use Update API in a case when IDs are auto-generated as an Upsert Operation? (this is me being curious). The Update API requires you to pass in the ID for the document you wish to update. AFAIK, you would need to make a GET call to elasticseach and then using the ID from the response, make the UPDATE call.

The more nested documents and levels, the more costly it is. It can slow down indexing significantly and is the main reason I do not think you will see much improvement from the standard tuning steps, which are largely aimed at immutable, non-nested documents.

The performance impact will be even greater if you update the same document frequently as this can result in a lot of very small segments.

With auto generated IDs you do indeed need to search before updating which is why it does not make any sense for your use case.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.