Accessing _id in ingest pipeline

Search After documentation (as well as one or two other places in docs) state that aggregation and sorting on _id field is inefficient, and it is better to create a duplicate of _id in another ordinary field with doc_values enabled. Doc also suggests using ingest pipeline:

Instead it is advised to duplicate (client side or with a set ingest processor)

However, I haven't found a way myself or any online example showing how to do that. It looks like pipeline is executed before assigning autogenerated _id. The following processor:

{
  "set": {
    "field": "tie_breaker_id",
    "value": "{{_id}}"
  }
}

assigns empty string to tie_breaker_id field.

Hey,

indeed. The id generation happens after the ingest pipeline is applied. I opened an issue at https://github.com/elastic/elasticsearch/issues/41163

--Alex

1 Like

Thanks for the reply. What would you suggest to implement desired _id field duplication? Only client-side processing (2 requests)?

May I ask about more information of the use-case here? If it is logging, I am wondering if search after is needed, if it is something else I am curious to get to know more.

That said, client side id generation and configuring an additional field would work indeed.

1 Like

It is not logging, it is storing some user-generated resources (like blog posts, for example) with different attributes for a powerful and fast search using Elasticsearch. I am using search_after for pagination (using ["_score", "_id"] as sorting parameters), because as I understand it is the optimal (if not only) way for traditional real-time user pagination.

if it is a blog post, maybe using the URL (or its slug) as the id might simplify things and also allow for stable lookups (and that field can also be part of the document itself), as that is another part of the data that should be unique?

1 Like

Of course, I also have id field with autogenerated ID from my primary RDBMS. But I have a collection where each "blog post" may have multiple documents, which are duplicates except for one field with geopoint (this structure is used for aggregation with filters to display points on a map), so any otherwise unique attributes of "blog post" are not unique here. I guess it will be easier therefore to generate elastic id on the client. One more question: do you think it is better to populate both _id and tie_breaker_id with the same generated value, or to generate only tie_breaker_id and use it, leaving _id alone? I don't want to bring any problems by creating my own primary ids.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.