Definitive guide for tuning using Marvel?

Is there a definitive guide available for tuning elasticsearch using Marvel ?

I have a 6 node cluster with 5 billion docs running over 2 VMs on VMWares vCloud environment.

A simple aggregation is taking ~40 seconds - how can I use Marvel to investigate this latency in a formal and methodical fashion ?

Thanks,

There is no guide -- at least not yet -- but options vary based on the version of Elasticsearch that you're running.

For example, ES 2.3 (not Marvel) supports a new query profiler. This could be immensely useful, but it does not yet profile aggregation performance. Even so, it will help to see the performance of the raw query that you're performing, which is a prerequisite to the overall aggregation.

In terms of using Marvel to help to tune ES, it's more of an operational tool for tuning overall performance, rather than individual requests. Having said that, you can use a fresh instance that only runs the individual request as a way to tune based on that request.

Things that you need to be aware of for tuning any request really has nothing to do with Marvel, but everything to do with some general guidelines:

  • The size parameter of every level of the request has a direct impact on the amount of data that needs to be passed around (both externally, which is the obvious part, but also internally within the cluster).
    • This is a big problem that I see users frequently run into by requesting scary sizes.
  • The amount of work done by the query is a baseline requirement for getting the overall request to run.
    • If you are not using filters when possible, then you should rework the request to use them because filters can be cached for repeated requests.
  • Any sorting that you may be doing.
  • The amount of work done by the aggs (aka aggregations) at each step.

Which of those would Marvel help with? Really only the last two because of the repeated sub-bullet. But it also shows search latency, which should highlight when this is becoming a problem.

Another point that I noticed from your post was that you mentioned 6 nodes were running on 2 VMs. Why are there 3 nodes per VM? You could look at Marvel to see what each node is doing in terms of its performance metrics while this type of slow request is being handled: does their JVM Heap utilization go up significantly? Does the CPU go crazy?

Hope that helps,
Chris