Hi @eugene_miretsky ,
Rally is used as the nightly benchmark tools for ElasticSearch - does that mean that one can benchmark and tune the cluster using the REST api, and expect the performance to mostly translate to the Bulk API?
I am the main author of Rally and there are multiple reasons why we use the Python REST API. First, HTTP is the only protocol that is supported by all programming languages (except Java for ES < 5.0) so the benchmark results are applicable to more scenarios. Second, Rally is implemented in Python and using the Python REST API is an obvious and pragmatic choice.
Also, the benchmark should not stress the benchmark driver (i.e. Rally) but Elasticsearch. Sure, data is sent via HTTP instead of the native transport but serialization overhead should not dominate throughput but rather the actual processing of the bulk request.
Lastly, as David has mentioned, we'll have a Java HTTP client in Elasticsearch 5.0. Having said all that, I've never benchmarked the real difference and I am also not aware of an in-depth article that explains the differences in performance.
- Does anybody have any good tools for testing the Java API?
Most performance testing tools / scripts I am aware of use the HTTP API. You could probably try the Yahoo! Cloud Serving Benchmark which is using the Elasticsearch transport client. But performance depends on so many factors (message length, bulk size, ES configuration, hardware specs, compression level (HTTP), number of concurrent operations, index size, JVM parameters, etc. etc. etc.) and the transport protocol is just one of them.
Also, mind pointing us to the source of the nightly benchmarks, they don't seem to be in the Rally tracks git?
You are looking at the results page of the classic benchmarks (which are not available as open source). The new benchmarks are located at https://elasticsearch-benchmarks.elastic.co/app/kibana#/dashboard/Nightly-Benchmark-Overview
The results you see there are run with the latest master version of Elasticsearch and the latest version of Rally. It is using the geonames track. In the near future we'll move to a different benchmarking environment and will run more tracks and publish more results.
You'll also note differences in the reported latency and throughput numbers. That's due to different measurement approaches. To give you just one example: The classic benchmark scripts use just one data point to determine indexing throughput whereas Rally considers a warmup period and then continuously samples indexing throughput.
I hope that answers your questions and if you need help with anything related to Rally just post a question here in the Elasticsearch forum.