Is there a way for me to assign the _id that is created when I create a document in Elasticsearch directly to a field in my document when its getting created?
I am currently trying to find the max _id of the documents in an index and incrementing it by 1 before I create the next document.
But currently, I am not getting any way to fetch the max _id (or _uid) of the documents in my index. So that idea is not working out.
Can someone please suggest, how I can do either one of these?
Its not 100% clear from your post if you are using autogenerated IDs (i.e. you are not specifying the id when you create the document) though are suspect you are not using auto generated ids. But if you are, the auto generated ids are not numeric so getting the max id would not make sense.
To find the max value of a field you can generally use the max aggregation, however the max aggregation only works on numeric fields and the _id in Elasticsearch is interpreted as a String and stored as such (actually this is a bit of a lie as its stored in a more specialised way but it is closest to a String or keyword type for the purposes of this explanation) so you can't do a max aggregation on the _id.
However, if you are setting the _id yourself you can always add your own my_id field which is mapped as a long type and use that to find the max version and rely on your client application to always send documents where the _id matches the my_id.
Why are you wanting to do this incrementing of the id before you index the next document? It is to simulate a counting sequential id? If so there are some things to consider:
Making a request to find the max id requires a call to Elasticsearch and for Elasticsearch to scan through all id values for all documents in your index. Doing this on every index request will impact indexing performance, as well as causing a lot of requests to Elasticsearch which may impact the performance of your other queries too. If your application is the only application indexing documents into this index I would suggest that you keep a counter of the last doc id in you client application and increment it from there.
If you are wanting to do this because you might have loads of client applications indexing documents independently then as well as the point above you could run into concurrency problems where client A gets the max id and goes to write the next document with the increased id, but before client A writes that new document, client B gets the max Id and will get the old max ID meaning that it will overwrite the document A is about to write. The only way to reliably avoid this is either to use auto-generated IDs (i.e. let Elasticsearch pick the id) or to use a field (or combination of fields) in your document which unique across all the documents you are going to write.
That was a really clean answer to my question. Apologies not giving clarity on my question.
I have only a single point to create documents, this is enforced in the application using it, to avoid concurrency issues. The reason I need the max id is that I need the id to be sequential as the order of creation is very critical for the application. I realized that the _id is not a number for me to do max aggr while I was testing my application, hence I asked how I can assign value to a field on creation time.
So I think I will have to take the first approach you suggested and create the id in my application. So two request have go to ES instance and I have to take care of any concurrency issue because of that.
Subsequently I would like to ask
Can I ensure that documents are not created in case there is an existing document with the given id if it is already existing(through create or update calls)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.