bulk indexing and TransportSerializationException

We are using ES to index larger docs of content typically pdf, docx
etc. The average size is around typically less that 1MB in size. Our
clients vary but typically we are going to need to index about 100-200
GB of this type of data, and then after that everything is real time.

I'm running ES embedded and once I kick off a reindex all the nodes
participate in digested the binary content into a string format that
gets fed to ES. The bulk of the work is really in the digesting on
our side. But I'm wondering if maybe I should look into using the
bulk api. After the initial setup we really wont' be doing much bulk
loading, but based on my timings the initial load may take 6 hours or
more, so any speed up would be great.

Also I was seeing a bunch of these types of errors

2013-02-18 12:08:43,443 WARN
elasticsearch[server2_ec2-204-236-163-41][generic][T#1043]
org.elasticsearch.cluster.action.shard - [server2_ec2-204-236-163-41]
sending failed shard for [sakai_index][1],
node[dmWqaCe0S4adiEAD_043qA], [R], s[INITIALIZING], reason [Failed to
start shard, message [RecoveryFailedException[[sakai_index][1]:
Recovery failed from
[server4_ec2-50-18-148-126][BQx-NOWeRG2uxTz0v2xi1w][inet[/10.171.159.235:9300]]{local=false}
into [server2_ec2-204-236-163-41][dmWqaCe0S4adiEAD_043qA][inet[/204.236.163.41:9300]]{local=false}];
nested: RemoteTransportException[Failed to deserialize exception
response from stream]; nested: TransportSerializationException[Failed
to deserialize exception response from stream]; nested:
InvalidClassException[failed to read class descriptor]; nested:
ClassNotFoundException[org.elasticsearch.transport.RemoteTransportException];
]]

2013-02-28 14:17:43,742 WARN
elasticsearch[server4_ec2-50-18-148-126][transport_client_worker][T#4]{New
I/O worker #4} org.elasticsearch.transport.netty -
[server4_ec2-50-18-148-126] Message not fully read (response) for
[17684] handler
future(org.elasticsearch.indices.recovery.RecoveryTarget$4@53a60164),
error [true], resetting

Which when I google says there is a version mismatch. I've doubled
check that and that's not the problem. I saw one issue in 0.20.5 that
looks like it might be related to this, upgraded and I'm still having
this issues.

I was doing a bunch of refresh and flush calls during my indexing,
from the research I've done I gather its best to just let ES do that
on its own. So I removed those and set these index properties:

"translog.flush_threshold_period" : "5s",
"refresh_interval" : "5s",

Those problems went away for a little longer but now are back again.
Would manual refresh cause that? I'm wondering if I was simply
causing so many merges that things were essentially stepping on each
other. Any ideas how what might cause this ?

--
John Bush
602-490-0470

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Do you use plugins, and same plugin versions on all nodes? Also on the
(Transport)Client?
Do you mix Elastiscearch versions?
Are you sure you run the same Java JVM version on all nodes in the
cluster, and also on the (Transport)Client?

Explanation: "TransportSerializationException[Failed to deserialize
exception response from stream]; nested: InvalidClassException[failed to
read class descriptor]; nested:
ClassNotFoundException[org.elasticsearch.transport.RemoteTransportException"
is logged if you have nodes in the cluster that fail to read encoded
Java classes on the wire.

Possible reasons:

  • Elasticsearch version mismatch between cluster nodes, in the case
    exception classes have been refactored, it gives fatal messages
  • missing plugin code on a node, and when plugins throw custom
    exceptions, they can't get transported to the node where the plugin is
    not installed
  • or you have JVM versions running that are incomptible to each other,
    for example, mixing Java 6 and 7 JVMs will not work together when
    classes are transported in the object input stream used on the netty layer

Flush/refresh actions do not hurt that much, they should not throw
exceptions, although 5s is a little short in my understanding.

Jörg

Am 28.02.13 22:19, schrieb John Bush:

We are using ES to index larger docs of content typically pdf, docx
etc. The average size is around typically less that 1MB in size. Our
clients vary but typically we are going to need to index about 100-200
GB of this type of data, and then after that everything is real time.

I'm running ES embedded and once I kick off a reindex all the nodes
participate in digested the binary content into a string format that
gets fed to ES. The bulk of the work is really in the digesting on
our side. But I'm wondering if maybe I should look into using the
bulk api. After the initial setup we really wont' be doing much bulk
loading, but based on my timings the initial load may take 6 hours or
more, so any speed up would be great.

Also I was seeing a bunch of these types of errors

2013-02-18 12:08:43,443 WARN
elasticsearch[server2_ec2-204-236-163-41][generic][T#1043]
org.elasticsearch.cluster.action.shard - [server2_ec2-204-236-163-41]
sending failed shard for [sakai_index][1],
node[dmWqaCe0S4adiEAD_043qA], [R], s[INITIALIZING], reason [Failed to
start shard, message [RecoveryFailedException[[sakai_index][1]:
Recovery failed from
[server4_ec2-50-18-148-126][BQx-NOWeRG2uxTz0v2xi1w][inet[/10.171.159.235:9300]]{local=false}
into [server2_ec2-204-236-163-41][dmWqaCe0S4adiEAD_043qA][inet[/204.236.163.41:9300]]{local=false}];
nested: RemoteTransportException[Failed to deserialize exception
response from stream]; nested: TransportSerializationException[Failed
to deserialize exception response from stream]; nested:
InvalidClassException[failed to read class descriptor]; nested:
ClassNotFoundException[org.elasticsearch.transport.RemoteTransportException];
]]

2013-02-28 14:17:43,742 WARN
elasticsearch[server4_ec2-50-18-148-126][transport_client_worker][T#4]{New
I/O worker #4} org.elasticsearch.transport.netty -
[server4_ec2-50-18-148-126] Message not fully read (response) for
[17684] handler
future(org.elasticsearch.indices.recovery.RecoveryTarget$4@53a60164),
error [true], resetting

Which when I google says there is a version mismatch. I've doubled
check that and that's not the problem. I saw one issue in 0.20.5 that
looks like it might be related to this, upgraded and I'm still having
this issues.

I was doing a bunch of refresh and flush calls during my indexing,
from the research I've done I gather its best to just let ES do that
on its own. So I removed those and set these index properties:

 "translog.flush_threshold_period" : "5s",
 "refresh_interval" : "5s",

Those problems went away for a little longer but now are back again.
Would manual refresh cause that? I'm wondering if I was simply
causing so many merges that things were essentially stepping on each
other. Any ideas how what might cause this ?

--
John Bush
602-490-0470

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

All those nodes are using a nfs share of the exact same elasticsearch jars.
I'm not sure plugins and since they all pointing to that same stuff that's
not it. These 4 nodes were cloned from the same instance in aws, same java
version everywhere. To me it seems like maybe there's some system/network
thing causing problems. Maybe node is getting partial data and that is why
the serialization errors ? It seems to happen consistently after about
15-20 minutes of indexing.

On Thursday, February 28, 2013 2:46:02 PM UTC-7, Jörg Prante wrote:

Do you use plugins, and same plugin versions on all nodes? Also on the
(Transport)Client?
Do you mix Elastiscearch versions?
Are you sure you run the same Java JVM version on all nodes in the
cluster, and also on the (Transport)Client?

Explanation: "TransportSerializationException[Failed to deserialize
exception response from stream]; nested: InvalidClassException[failed to
read class descriptor]; nested:
ClassNotFoundException[org.elasticsearch.transport.RemoteTransportException"

is logged if you have nodes in the cluster that fail to read encoded
Java classes on the wire.

Possible reasons:

  • Elasticsearch version mismatch between cluster nodes, in the case
    exception classes have been refactored, it gives fatal messages
  • missing plugin code on a node, and when plugins throw custom
    exceptions, they can't get transported to the node where the plugin is
    not installed
  • or you have JVM versions running that are incomptible to each other,
    for example, mixing Java 6 and 7 JVMs will not work together when
    classes are transported in the object input stream used on the netty layer

Flush/refresh actions do not hurt that much, they should not throw
exceptions, although 5s is a little short in my understanding.

Jörg

Am 28.02.13 22:19, schrieb John Bush:

We are using ES to index larger docs of content typically pdf, docx
etc. The average size is around typically less that 1MB in size. Our
clients vary but typically we are going to need to index about 100-200
GB of this type of data, and then after that everything is real time.

I'm running ES embedded and once I kick off a reindex all the nodes
participate in digested the binary content into a string format that
gets fed to ES. The bulk of the work is really in the digesting on
our side. But I'm wondering if maybe I should look into using the
bulk api. After the initial setup we really wont' be doing much bulk
loading, but based on my timings the initial load may take 6 hours or
more, so any speed up would be great.

Also I was seeing a bunch of these types of errors

2013-02-18 12:08:43,443 WARN
elasticsearch[server2_ec2-204-236-163-41][generic][T#1043]
org.elasticsearch.cluster.action.shard - [server2_ec2-204-236-163-41]
sending failed shard for [sakai_index][1],
node[dmWqaCe0S4adiEAD_043qA], [R], s[INITIALIZING], reason [Failed to
start shard, message [RecoveryFailedException[[sakai_index][1]:
Recovery failed from

[server4_ec2-50-18-148-126][BQx-NOWeRG2uxTz0v2xi1w][inet[/10.171.159.235:9300]]{local=false}

into
[server2_ec2-204-236-163-41][dmWqaCe0S4adiEAD_043qA][inet[/204.236.163.41:9300]]{local=false}];

nested: RemoteTransportException[Failed to deserialize exception
response from stream]; nested: TransportSerializationException[Failed
to deserialize exception response from stream]; nested:
InvalidClassException[failed to read class descriptor]; nested:

ClassNotFoundException[org.elasticsearch.transport.RemoteTransportException];

]]

2013-02-28 14:17:43,742 WARN

elasticsearch[server4_ec2-50-18-148-126][transport_client_worker][T#4]{New

I/O worker #4} org.elasticsearch.transport.netty -
[server4_ec2-50-18-148-126] Message not fully read (response) for
[17684] handler
future(org.elasticsearch.indices.recovery.RecoveryTarget$4@53a60164),
error [true], resetting

Which when I google says there is a version mismatch. I've doubled
check that and that's not the problem. I saw one issue in 0.20.5 that
looks like it might be related to this, upgraded and I'm still having
this issues.

I was doing a bunch of refresh and flush calls during my indexing,
from the research I've done I gather its best to just let ES do that
on its own. So I removed those and set these index properties:

 "translog.flush_threshold_period" : "5s", 
 "refresh_interval" : "5s", 

Those problems went away for a little longer but now are back again.
Would manual refresh cause that? I'm wondering if I was simply
causing so many merges that things were essentially stepping on each
other. Any ideas how what might cause this ?

--
John Bush
602-490-0470

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In case of incomplete network buffers, the error would be an IOException
in netty, something like "connection reset by peer". I think ES can't
handle one of your input documents, creates an internal exception, which
may be not serializable, and this breaks netty transport and is reported
back to your bulk indexing.

Jörg

Am 28.02.13 23:25, schrieb John Bush:

Maybe node is getting partial data and that is why the serialization
errors ? It seems to happen consistently after about 15-20 minutes of
indexing.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.