Search After documentation (as well as one or two other places in docs) state that aggregation and sorting on _id field is inefficient, and it is better to create a duplicate of _id in another ordinary field with doc_values enabled. Doc also suggests using ingest pipeline:
However, I haven't found a way myself or any online example showing how to do that. It looks like pipeline is executed before assigning autogenerated _id. The following processor:
May I ask about more information of the use-case here? If it is logging, I am wondering if search after is needed, if it is something else I am curious to get to know more.
That said, client side id generation and configuring an additional field would work indeed.
It is not logging, it is storing some user-generated resources (like blog posts, for example) with different attributes for a powerful and fast search using Elasticsearch. I am using search_after for pagination (using ["_score", "_id"] as sorting parameters), because as I understand it is the optimal (if not only) way for traditional real-time user pagination.
if it is a blog post, maybe using the URL (or its slug) as the id might simplify things and also allow for stable lookups (and that field can also be part of the document itself), as that is another part of the data that should be unique?
Of course, I also have id field with autogenerated ID from my primary RDBMS. But I have a collection where each "blog post" may have multiple documents, which are duplicates except for one field with geopoint (this structure is used for aggregation with filters to display points on a map), so any otherwise unique attributes of "blog post" are not unique here. I guess it will be easier therefore to generate elastic id on the client. One more question: do you think it is better to populate both _id and tie_breaker_id with the same generated value, or to generate only tie_breaker_id and use it, leaving _id alone? I don't want to bring any problems by creating my own primary ids.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.