Document corruption in index, id field is garbled text

anurag_naidu · August 15, 2014, 1:12pm

We are using ES 1.2.2 server with a rails application as the client
(ActiveRecord document model) and it seems as though some of the documents
in the index might have been corrupted because the *id *field of the
document is some garbled text like "JorMcjefSe2_VQkP_ntd8Q" when its
supposed to be an Integer value based on the mappings.

As an example here is a document in the index with a corrupted id. Notice
the corrupted document id, and the source id of the document is null

curl -XGET
http://localhost:9200/production_restaurants/restaurant/Gu-NGnHtR3ef4V2z4NfNsQ?pretty
{
"_index" : "production_restaurants_20140714222814907",
"_type" : "restaurant",
"_id" : "Gu-NGnHtR3ef4V2z4NfNsQ",
"_version" : 1,
"found" : true,
"_source":{"_id":null,"_type":"restaurant","title":"Wreck Bar and
Grill","address":"Rum
Point","phone":null,"location_hint":null,"popularity":0,"votes_percent":null,"price":null,"city":null,"state":"KY","zip":null,"city_id":375,"neighborhood_id":54892,"activity":null,"location":{"lat":19.371508,"lon":-81.271523},"closed":false,"neighborhood":{"title":"Grand
Cayman","id":54892},"cuisines":[],"tags":[],"dishes":[],"restaurant_path":"http://www.urbanspoon.com"}
}

It seems like the corruption might be around document deletion from the
index because such indexed documents are no longer in our MySQL data which
is the source for indexing documents in ES. Aside from finding what the
issue might be with corruption, I am right now looking to find such bad
documents in the index. I am finding no love with either a regex query
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#query-dsl-regexp-query or
the missing filter
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_dealing_with_null_values.html#_missing_filter which
i apply them to the id field. Its a strange situation because *id *is of
type *integer *in my index mapping i cannot apply regex query to it and get
a NumberFormatException from Lucene.

Any suggestion for a query that I could use to find such corrupted
documents and remove them ahead of time. Right now I've had to be very
reactive and remove these as I discover them my rails logs / error reports.
Before I consider a full-reindex (which is heavy in of itself) I would like
to explore what other options I have, including what might be the cause of
these corruption.

thanks

anurag

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8bd57820-6647-44f9-a089-2f22c2c83431%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rafal_Kuc_3 · August 15, 2014, 1:46pm

Hello!

Your document is not corrupted - during indexation the _id field was set to null - this is what _source shows. The _id you are seeing, that contains a random characters was just generated by Elasticsearch, which is the default behavior if you don't specify the id. Let me give an example - try to index the following document:

$ curl -XPOST 'localhost:9200/test/doc/' -d '{

"name" : "test",

"_id" : null

}'

Now if you search for all the document in that test index you would see the following:

$ curl -XGET 'localhost:9200/test/_search?pretty'

{

"took" : 2,

"timed_out" : false,

"_shards" : {

"total" : 5,


"successful" : 5,


"failed" : 0

},

"hits" : {

"total" : 1,


"max_score" : 1.0,


"hits" : [ {


  "_index" : "test",


  "_type" : "doc",


  "_id" : "VVERnyl_TU6iBLT3ndnniA",


  "_score" : 1.0,


  "_source":{

"name" : "test",

"_id" : null

}

} ]

}

As you can see the _id field is generated, while the _id field passed in _source is null.

As for finding documents with _id set to null in the _source, you can try script filter - something like this:

curl -XGET 'localhost:9200/test/_search?pretty' -d '{

"query" : {

"filtered" : {

"filter": {

"script" : {


 "script": "_source.containsKey(\"_id\") &amp;&amp; _source._id == null"


}

}

}'

You are using 1.2.2, so you have to enable dynamic scripting for the above query to work and disable it again once not needed, or just put the script on the file system for Elasticsearch to see it and use its name.

--

Regards,

Rafał Kuć

Performance Monitoring * Log Analytics * Search Analytics

Solr & Elasticsearch Support * http://sematext.com/

We are using ES 1.2.2 server with a rails application as the client (ActiveRecord document model) and it seems as though some of the documents in the index might have been corrupted because the id field of the document is some garbled text like "JorMcjefSe2_VQkP_ntd8Q" when its supposed to be an Integer value based on the mappings.

As an example here is a document in the index with a corrupted id. Notice the corrupted document id, and the source id of the document is null

curl -XGET http://localhost:9200/production_restaurants/restaurant/Gu-NGnHtR3ef4V2z4NfNsQ?pretty

{

"_index" : "production_restaurants_20140714222814907",

"_type" : "restaurant",

"_id" : "Gu-NGnHtR3ef4V2z4NfNsQ",

"_version" : 1,

"found" : true,

"_source":{"_id":null,"_type":"restaurant","title":"Wreck Bar and Grill","address":"Rum Point","phone":null,"location_hint":null,"popularity":0,"votes_percent":null,"price":null,"city":null,"state":"KY","zip":null,"city_id":375,"neighborhood_id":54892,"activity":null,"location":{"lat":19.371508,"lon":-81.271523},"closed":false,"neighborhood":{"title":"Grand Cayman","id":54892},"cuisines":[],"tags":[],"dishes":[],"restaurant_path":"http://www.urbanspoon.com"}

}

It seems like the corruption might be around document deletion from the index because such indexed documents are no longer in our MySQL data which is the source for indexing documents in ES. Aside from finding what the issue might be with corruption, I am right now looking to find such bad documents in the index. I am finding no love with either a regex query or the missing filter which i apply them to the id field. Its a strange situation because id is of type integer in my index mapping i cannot apply regex query to it and get a NumberFormatException from Lucene.

Any suggestion for a query that I could use to find such corrupted documents and remove them ahead of time. Right now I've had to be very reactive and remove these as I discover them my rails logs / error reports. Before I consider a full-reindex (which is heavy in of itself) I would like to explore what other options I have, including what might be the cause of these corruption.

thanks

anurag

--

You received this message because you are subscribed to the Google Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8bd57820-6647-44f9-a089-2f22c2c83431%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

anurag_naidu · August 15, 2014, 3:38pm

Hi Rafal,

Thanks for the helpful insights, and setting me in the right direction. All
of those makes sense, now to investigate why Activerecord might be
triggering these document to index without an id.

anurag

On Friday, August 15, 2014 6:46:30 AM UTC-7, Rafał Kuć wrote:

Hello!

Your document is not corrupted - during indexation the _id field was set
to null - this is what _source shows. The _id you are seeing, that contains
a random characters was just generated by Elasticsearch, which is the
default behavior if you don't specify the id. Let me give an example - try
to index the following document:

$ curl -XPOST 'localhost:9200/test/doc/' -d '{
"name" : "test",
"_id" : null
}'

Now if you search for all the document in that test index you would see
the following:
$ curl -XGET 'localhost:9200/test/_search?pretty'

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "doc",
"_id" : "VVERnyl_TU6iBLT3ndnniA",
"_score" : 1.0,
"_source":{
"name" : "test",
"_id" : null
}
} ]
}
}

As you can see the _id field is generated, while the _id field passed in
_source is null.

As for finding documents with _id set to null in the _source, you can try
script filter - something like this:

curl -XGET 'localhost:9200/test/_search?pretty' -d '{
"query" : {
"filtered" : {
"filter": {
"script" : {
"script": "_source.containsKey("_id") && _source._id == null"
}
}
}
}
}'

You are using 1.2.2, so you have to enable dynamic scripting for the above
query to work and disable it again once not needed, or just put the script
on the file system for Elasticsearch to see it and use its name.

*-- Regards, Rafał Kuć Performance Monitoring * Log Analytics * Search
Analytics Solr & Elasticsearch Support * *http://sematext.com/

We are using ES 1.2.2 server with a rails application as the client
(ActiveRecord document model) and it seems as though some of the documents
in the index might have been corrupted because the *id *field of the
document is some garbled text like "JorMcjefSe2_VQkP_ntd8Q" when its
supposed to be an Integer value based on the mappings.

As an example here is a document in the index with a corrupted id. Notice
the corrupted document id, and the source id of the document is null

curl -XGET
http://localhost:9200/production_restaurants/restaurant/Gu-NGnHtR3ef4V2z4NfNsQ?pretty
{
"_index" : "production_restaurants_20140714222814907",
"_type" : "restaurant",
"_id" : "Gu-NGnHtR3ef4V2z4NfNsQ",
"_version" : 1,
"found" : true,
"_source":{"_id":null,"_type":"restaurant","title":"Wreck Bar and
Grill","address":"Rum
Point","phone":null,"location_hint":null,"popularity":0,"votes_percent":null,"price":null,"city":null,"state":"KY","zip":null,"city_id":375,"neighborhood_id":54892,"activity":null,"location":{"lat":19.371508,"lon":-81.271523},"closed":false,"neighborhood":{"title":"Grand
Cayman","id":54892},"cuisines":,"tags":,"dishes":,"restaurant_path":"
http://www.urbanspoon.com"}
}

It seems like the corruption might be around document deletion from the
index because such indexed documents are no longer in our MySQL data which
is the source for indexing documents in ES. Aside from finding what the
issue might be with corruption, I am right now looking to find such bad
documents in the index. I am finding no love with either a regex query
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#query-dsl-regexp-query
or the missing filter
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_dealing_with_null_values.html#_missing_filter
which i apply them to the id field. Its a strange situation because *id *is
of type *integer *in my index mapping i cannot apply regex query to it
and get a NumberFormatException from Lucene.

Any suggestion for a query that I could use to find such corrupted
documents and remove them ahead of time. Right now I've had to be very
reactive and remove these as I discover them my rails logs / error reports.
Before I consider a full-reindex (which is heavy in of itself) I would like
to explore what other options I have, including what might be the cause of
these corruption.

thanks

anurag
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8bd57820-6647-44f9-a089-2f22c2c83431%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8bd57820-6647-44f9-a089-2f22c2c83431%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bf820eec-1030-4292-a439-54789eb5dc6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.