Accessing _id in ingest pipeline

Oleg-Arkhipov · April 11, 2019, 9:23pm

Search After documentation (as well as one or two other places in docs) state that aggregation and sorting on _id field is inefficient, and it is better to create a duplicate of _id in another ordinary field with doc_values enabled. Doc also suggests using ingest pipeline:

Instead it is advised to duplicate (client side or with a set ingest processor)

However, I haven't found a way myself or any online example showing how to do that. It looks like pipeline is executed before assigning autogenerated _id. The following processor:

{
  "set": {
    "field": "tie_breaker_id",
    "value": "{{_id}}"
  }
}

assigns empty string to tie_breaker_id field.

spinscale · April 12, 2019, 6:19pm

Hey,

indeed. The id generation happens after the ingest pipeline is applied. I opened an issue at https://github.com/elastic/elasticsearch/issues/41163

--Alex

Oleg-Arkhipov · April 12, 2019, 6:31pm

Thanks for the reply. What would you suggest to implement desired _id field duplication? Only client-side processing (2 requests)?

spinscale · April 12, 2019, 6:34pm

May I ask about more information of the use-case here? If it is logging, I am wondering if search after is needed, if it is something else I am curious to get to know more.

That said, client side id generation and configuring an additional field would work indeed.

Oleg-Arkhipov · April 12, 2019, 8:07pm

It is not logging, it is storing some user-generated resources (like blog posts, for example) with different attributes for a powerful and fast search using Elasticsearch. I am using search_after for pagination (using ["_score", "_id"] as sorting parameters), because as I understand it is the optimal (if not only) way for traditional real-time user pagination.

spinscale · April 12, 2019, 8:12pm

if it is a blog post, maybe using the URL (or its slug) as the id might simplify things and also allow for stable lookups (and that field can also be part of the document itself), as that is another part of the data that should be unique?

Oleg-Arkhipov · April 12, 2019, 8:23pm

Of course, I also have id field with autogenerated ID from my primary RDBMS. But I have a collection where each "blog post" may have multiple documents, which are duplicates except for one field with geopoint (this structure is used for aggregation with filters to display points on a map), so any otherwise unique attributes of "blog post" are not unique here. I guess it will be easier therefore to generate elastic id on the client. One more question: do you think it is better to populate both _id and tie_breaker_id with the same generated value, or to generate only tie_breaker_id and use it, leaving _id alone? I don't want to bring any problems by creating my own primary ids.

system · May 10, 2019, 8:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest pipeline with enrich policy that matches by _id - not supported? Elasticsearch	2	774	December 18, 2020
How to set “_id” value in elasticsearch document as my custom document id Elasticsearch	9	16581	December 10, 2020
Insert ingest pipeline if id is new Elasticsearch	1	429	April 16, 2019
Ingest pipeline: _id generation Elasticsearch	5	2191	October 19, 2018
Set _id during pipeline in bulk ingestion Elasticsearch ingest-pipeline	5	211	April 11, 2024

Accessing _id in ingest pipeline

Related topics