Issue when indexing to elasticsearch from apache nutch


(Sachin Shaju) #1

I was trying to index from apache nutch to single node ES cluster and got this error.

org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:173) at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:125) at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443) at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.StreamCorruptedException: Unsupported version: 1 at org.elasticsearch.common.io.ThrowableObjectInputStream.readStreamHeader(ThrowableObjectInputStream.java:46) at java.io.ObjectInputStream.(ObjectInputStream.java:301) at org.elasticsearch.common.io.ThrowableObjectInputStream.(ThrowableObjectInputStream.java:38) at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:170) ... 23 more

From further research I came to know I should have same jvm version on client and ES server. Reference : http://jontai.me/blog/2013/06/elasticsearch-remotetransportexception-failed-to-deserialize-exception-response-from-stream/

I'm using ES version 2.3.2 and my JVM version is "1.8.0_91". When I checked /plugins/indexer-elastic/plugin.xml,the version specified is 1.4.1. I would like to know this could be the issue and a possible solution other than to downgrade ES cluster version. I would like to continue with ES 2.3.2. Please help me on this.

PS : The command I've used for indexing is bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20160801174223/


(Jörg Prante) #2

This is a question for the Nutch community.

You have to build Nutch from source (master branch). It support ES 2.3.3 https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic/plugin.xml


(Sachin Shaju) #3

I asked the question here because the exception was specific to elasticsearch. Thanks for your reply :slight_smile:


(Sachin Shaju) #4

It worked. Thanks @jprante :slight_smile:


(Nestor A Florez) #5

Hi Sachin,
Do you have any info in how you made work? I am trying to hook nutch 1.12 and Elasticsearch 2.4. My website is crawled, I edited the nutch-site.xml. I can see info in port 9200. I just do not know how to see the data. or how to configure fields to display. Any examples?

Thanks,

Nestor


(Sachin Shaju) #6

Have you tried crawl script in nutch as bin/crawl -i urls/ CrawlDir/ 1 to crawl and index a site ?


(system) #7