Elasticsearch index contains documents where source is invalid json


(Stefan Stubbe) #1

I'm in the process of migrating a complete index from one cluster to another.
I'm selecting all the documents from the old cluster with the scroll api and bulk indexing them to the new cluster, using elasticsearch-php.

The issue I'm currently having is that while "scrolling" through the documents on the old cluster, I am getting JsonDeserializationError (php) exception with the message "Syntax error".

It took me some time to figure out the root cause of this exception, but apparently I have documents in the old cluster off which the contents is not valid json.

A simplyfied format of the document is:

  { { /* first json object */ } { /* second json object */ } { /* second json object */ } }

(note that there is no comma (,) between the inner objects.

I have no idea how this kind of invalid document could ever have been indexed in ES. I have serveral of these types of documents, all dating from January 2015.

This particular index has survived a few elasticsearch upgrades, but not sure which version of ES I was using at the moment the documents have been indexed, but according the released dates it could not have been anything later than version 1.4.2.

I was really surprised that this could ever happen, but is probably related to a bug that occurred in an older version.

Any way, this makes moving my index to the new cluster quite troublesome:

  • The elasticsearch-php libary tries to parse the ES response and throws an exception
  • I could catch the exception, but there is no easy way to know which document caused this
  • I could set the page size to 1 document and then I could just ignore the exception, but does not really seem very efficient on resources.

Is there a quick way to know which documents contain invalid json?
Or a kind of fsck process to detect these kind of documents?

I am currently making a PHP script for selecting all the items, but doing so without using the elasticsearch-php library, because the library does not provide an option to just return the document as a string.


(Balu Giduturi) #2

Hi Stefan,

I also faced the same problem, the issue is :
scroll id malfunctioned don't know how

i tried to overcome this by exceptional handling (try catch) below which worked for me;

$docs = $_client->search($searchparams);
$scroll_id = $docs['_scroll_id'];
 while (\true) {
 try {
	$response = $_client->scroll(
		array(
		    "scroll_id" => $scroll_id,
		    "scroll" => "20m"
		)
	);
	$scroll_id = $response['_scroll_id'];

	// your stuff
}
catch (Exception $error) {
                echo "scroll_id :::::" . $scroll_id . "   " . $error->getMessage();
            }
 }

(system) #3