I'm in the process of migrating a complete index from one cluster to another.
I'm selecting all the documents from the old cluster with the scroll api and bulk indexing them to the new cluster, using elasticsearch-php.
The issue I'm currently having is that while "scrolling" through the documents on the old cluster, I am getting JsonDeserializationError (php) exception with the message "Syntax error".
It took me some time to figure out the root cause of this exception, but apparently I have documents in the old cluster off which the contents is not valid json.
A simplyfied format of the document is:
{ { /* first json object */ } { /* second json object */ } { /* second json object */ } }
(note that there is no comma (,
) between the inner objects.
I have no idea how this kind of invalid document could ever have been indexed in ES. I have serveral of these types of documents, all dating from January 2015.
This particular index has survived a few elasticsearch upgrades, but not sure which version of ES I was using at the moment the documents have been indexed, but according the released dates it could not have been anything later than version 1.4.2.
I was really surprised that this could ever happen, but is probably related to a bug that occurred in an older version.
Any way, this makes moving my index to the new cluster quite troublesome:
- The elasticsearch-php libary tries to parse the ES response and throws an exception
- I could catch the exception, but there is no easy way to know which document caused this
- I could set the page size to 1 document and then I could just ignore the exception, but does not really seem very efficient on resources.
Is there a quick way to know which documents contain invalid json?
Or a kind of fsck process to detect these kind of documents?
I am currently making a PHP script for selecting all the items, but doing so without using the elasticsearch-php library, because the library does not provide an option to just return the document as a string.