ElasticSearch couchdb river not processing all documents


(wittmam-2) #1

I have 7 indices on ES 0.19.2 (but also tried on 0.19.4) that leverage
couchdb river against couchdb 1.2. 6 of the 7 indices are loading fine
but one is stopping consistently at roughly the same document. CouchDB has
600k documents but only 35k are being processed. However, when polling
couchdb for changes there are changes for all documents returned.

I see in ES river the last sequence ID is 34500 and if I run:

curl -X GET http://localhost:5984/reviews/_changes?since=34500

I see all the remaining change events returned. I also do not see any
errors in the logs even after turning on "river.couchdb: TRACE" in the
logging.yml file.

I am not able to see any difference between a document that was loaded into
ES and one that was not.

Also, I am not seeing a heartbeat for this river but I do see the other
rivers.

Here is the gist of the how I created the index, mapping and river:
git://gist.github.com/3076793.git

Any suggestions or guidance would be appreciated!


(David Pilato) #2

Sometimes it fails when a special character is in the couchDb document (Check UTF8 encoding for this document)

HTH
David :wink:
Twitter : @dadoonet / @elasticsearchfr

Le 9 juil. 2012 à 16:37, wittmam wittmam@gmail.com a écrit :

I have 7 indices on ES 0.19.2 (but also tried on 0.19.4) that leverage couchdb river against couchdb 1.2. 6 of the 7 indices are loading fine but one is stopping consistently at roughly the same document. CouchDB has 600k documents but only 35k are being processed. However, when polling couchdb for changes there are changes for all documents returned.

I see in ES river the last sequence ID is 34500 and if I run:

curl -X GET http://localhost:5984/reviews/_changes?since=34500

I see all the remaining change events returned. I also do not see any errors in the logs even after turning on "river.couchdb: TRACE" in the logging.yml file.

I am not able to see any difference between a document that was loaded into ES and one that was not.

Also, I am not seeing a heartbeat for this river but I do see the other rivers.

Here is the gist of the how I created the index, mapping and river:
git://gist.github.com/3076793.git

Any suggestions or guidance would be appreciated!


(wittmam-2) #3

David,

It turns out the document that it failed on contained a value that should
be a string but was being sent as a 21 digit number value in the JSON which
as we know is larger than the 15 digit limitation for javascript. This
would have been cast to a string via the mapping specified when creating
the index but because the river failed to process the document it was never
applied. I would hope the river would just fail this one doc and throw it
out similar to a mapping error in ES vs. completely fail and not load in
additional documents. Once the river is recreated, it fails on the same
doc again and therefore there is no recovery unless the document is removed
or fixed.

If anyone has a good way to gracefully handle this situation I am all ears.

On Monday, July 9, 2012 9:48:19 AM UTC-5, David Pilato wrote:

Sometimes it fails when a special character is in the couchDb document
(Check UTF8 encoding for this document)

HTH
David :wink:
Twitter : @dadoonet / @elasticsearchfr

Le 9 juil. 2012 à 16:37, wittmam a écrit :

I have 7 indices on ES 0.19.2 (but also tried on 0.19.4) that leverage
couchdb river against couchdb 1.2. 6 of the 7 indices are loading fine
but one is stopping consistently at roughly the same document. CouchDB has
600k documents but only 35k are being processed. However, when polling
couchdb for changes there are changes for all documents returned.

I see in ES river the last sequence ID is 34500 and if I run:

curl -X GET http://localhost:5984/reviews/_changes?since=34500

I see all the remaining change events returned. I also do not see any
errors in the logs even after turning on "river.couchdb: TRACE" in the
logging.yml file.

I am not able to see any difference between a document that was loaded
into ES and one that was not.

Also, I am not seeing a heartbeat for this river but I do see the other
rivers.

Here is the gist of the how I created the index, mapping and river:
git://gist.github.com/3076793.git

Any suggestions or guidance would be appreciated!


(system) #4