We are using ES to index larger docs of content typically pdf, docx
etc. The average size is around typically less that 1MB in size. Our
clients vary but typically we are going to need to index about 100-200
GB of this type of data, and then after that everything is real time.
I'm running ES embedded and once I kick off a reindex all the nodes
participate in digested the binary content into a string format that
gets fed to ES. The bulk of the work is really in the digesting on
our side. But I'm wondering if maybe I should look into using the
bulk api. After the initial setup we really wont' be doing much bulk
loading, but based on my timings the initial load may take 6 hours or
more, so any speed up would be great.
Also I was seeing a bunch of these types of errors
2013-02-18 12:08:43,443 WARN
elasticsearch[server2_ec2-204-236-163-41][generic][T#1043]
org.elasticsearch.cluster.action.shard - [server2_ec2-204-236-163-41]
sending failed shard for [sakai_index][1],
node[dmWqaCe0S4adiEAD_043qA], [R], s[INITIALIZING], reason [Failed to
start shard, message [RecoveryFailedException[[sakai_index][1]:
Recovery failed from
[server4_ec2-50-18-148-126][BQx-NOWeRG2uxTz0v2xi1w][inet[/10.171.159.235:9300]]{local=false}
into [server2_ec2-204-236-163-41][dmWqaCe0S4adiEAD_043qA][inet[/204.236.163.41:9300]]{local=false}];
nested: RemoteTransportException[Failed to deserialize exception
response from stream]; nested: TransportSerializationException[Failed
to deserialize exception response from stream]; nested:
InvalidClassException[failed to read class descriptor]; nested:
ClassNotFoundException[org.elasticsearch.transport.RemoteTransportException];
]]
2013-02-28 14:17:43,742 WARN
elasticsearch[server4_ec2-50-18-148-126][transport_client_worker][T#4]{New
I/O worker #4} org.elasticsearch.transport.netty -
[server4_ec2-50-18-148-126] Message not fully read (response) for
[17684] handler
future(org.elasticsearch.indices.recovery.RecoveryTarget$4@53a60164),
error [true], resetting
Which when I google says there is a version mismatch. I've doubled
check that and that's not the problem. I saw one issue in 0.20.5 that
looks like it might be related to this, upgraded and I'm still having
this issues.
I was doing a bunch of refresh and flush calls during my indexing,
from the research I've done I gather its best to just let ES do that
on its own. So I removed those and set these index properties:
"translog.flush_threshold_period" : "5s",
"refresh_interval" : "5s",
Those problems went away for a little longer but now are back again.
Would manual refresh cause that? I'm wondering if I was simply
causing so many merges that things were essentially stepping on each
other. Any ideas how what might cause this ?
--
John Bush
602-490-0470
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.