We are using ES6.4 to index a cache whose values are xml documents. The key of this cache is generated capturing values from the xml document, so it's easy to get ids longer than 512 bytes.
Then when we perform an index request, the _id must be this key and since it's longer than 512, the bulk request fails.
So, is it possible (by config or in any way) to remove this limitation?
I wasn't even aware of a length restriction on the _id field. There's nothing about it in the docs so I'm not sure if it can be removed/changed.
Do you have to use that value as the actual document ID? If not, you could let Elasticsearch generate an _id value for you (by not specifying one yourself), then add your own ID value to a different field that's mapped as a keyword type. It would have implications on how you access the documents but those may not affect you.
Our development is external, here I simplified the question but it's a bit more complicated. This cache belongs to a bank ... This hash you suggest implies work not only on our side but on their side... I don't think they are going to change the cache design because I tell them ES does not support ids longer than 512 bytes.
Instead of this, why not to add a new config parameter, something like "index.max_id_length=xxxx", and set it to 512 by default?
Thanks.
And when you validate the id length, instead of "id_length > 512 then error" write "id_length > index.max_id_length then error".
We need to remove documents using the _id. In the remove method we only receive the cacheId and the cache object. If we use an autogenerated ID when inserting documents, we will need to store it into the cache object, and now that's not possible because we don't control this object. So my only chance is to add an intermediate concurrentmap where the key is the cacheId and the value the autogenerated id. Yes, I can do it.
But my question is: is this 512 bytes limitation a technical limitation? So if you say it is technically impossible to have ids longer than 512 then ok, that's the end of the story.
But if this limitation is because at any point in the past someone though that having such a long ids would be awful in terms of performance, maybe it would be great to let this decission to the end user. If someone needs longer ids, and if this is a performance problem it will be his/her job to request for more cores, memory or whatever he/she needs.
I don't know what implications can have to allow these longer ids (by config) in terms of your code. Maybe it will be easy to do it if this limit is only used in validation time when indexing documents, but I don't know if it's being used for something else.
It's a technical limitation in that it's a coded one. I don't know why it's there though, you may want to raise an issue to seek further clarification.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.