Jörg,
I completely agree with your indexing advice. To summarize for Rob:
- You’re doing it pretty much how everyone does it.
 - If you’re taking down your cluster, you need to slow down the rate you
index. Use a configurablesleep, or fewer Hadoop machines. - If you want to index documents faster, the best thing you can do is run
more shards on more Elasticsearch nodes (one shard per node is optimal). It
might also help to use better hardware (like SSDs), but I haven’t profiled
that. - The Elasticsearch defaults for indexing are pretty good, but you might
be able to tweak them to get tens of percent improvements. See the links in
my original post. 
I remain a little skeptical about the Thrift API, though. I’ve looked at it
several times, and it’s really more of an HTTP on Thrift API than a first
class Thrift API. The Thrift struct for API requests has a binary blob
body
(https://github.com/elasticsearch/elasticsearch-transport-thrift/blob/master/elasticsearch.thrift#L23).
I haven’t bothered to fully trace the code, and the usage isn’t thoroughly
documented, but I presume that to use the Thrift API you form the JSON for
an API request, serialized it to UTF-8, and then put it in the body field
of a Thrift RestRequest. Please correct me if I’m working about that.
Parsing a Thrift API request might be somewhat less work for Elasticsearch
than parsing an HTTP API request, but parsing the body contents of a
Thrift API request is going to be the same parsing the body contents of an
HTTP request. I haven’t profiled this, but I’d be surprised if
Elasticsearch was spending a ton of time parsing HTTP overhead.
Regardless, thanks for your advice on indexing. I really appreciate it.
-Jon
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.