Paramedic - what do the the values in the index shard details mean?

We're using Paramedic (https://github.com/karmi/elasticsearch-paramedic)
and finding it helpful to monitor our ElasticSearch cluster.

At the bottom of the display it shows a summary line for each index
with a Show Details button. When click that reveals a series of boxes
for each shard (columns) on each node (rows).

In each box there's a status heading and two figures: a time and a size.
There's no indication what the figures relate to, other than relating to
a particular shard on a particular node, obviously.

I'm especially interested because we're having some performance issues
and these number may shed light on it. For reference we have 3 nodes and
our indices are about 5GB and get rolled over daily.

Most of the figures are very low, say 20ms and 58b, but the shards for
one row (node) show very high figures, say 52s and 1012mb. (I also once
saw high values in just one column, i.e., one shard across several nodes.)

It's possible it may be due to the particular way we're loading and
querying the data, but I can't be sure without knowing more about the
numbers.

I'd be grateful if someone could shed some light on them.

Tim.

--

Tim Bunce wrote:

We're using Paramedic (https://github.com/karmi/elasticsearch-paramedic)
and finding it helpful to monitor our ElasticSearch cluster.

[...]

Most of the figures are very low, say 20ms and 58b, but the shards
for one row (node) show very high figures, say 52s and 1012mb. (I
also once saw high values in just one column, i.e., one shard
across several nodes.)

It's possible it may be due to the particular way we're loading and
querying the data, but I can't be sure without knowing more about
the numbers.

I'd be grateful if someone could shed some light on them.

They're related to recovery, which is ES's term for initializing
shards and making them ready for use. I believe the numbers are only
updated when the shard is first recovered. Mine do not seem to
update for ongoing replication.

The primary shards (blue) will probably never have large numbers
because they don't need to recover data (unless a replica has been
promoted to a primary). And you may small numbers on replica shards
if they were created at index time.

From my usage, the numbers seems to be large when a replica was
created from a large-ish primary shard and had to be recovered with a
non-trivial amount of data. In your case, that 52s/1G was likely an
initialized shard on a remote node, which is effectively 19.69MBps.
That may or may not be acceptable depending on your network, but it's
likely normal operation.

-Drew

--

To add to Drew's earlier explanation:

  • Paramedic is just an interface to the "Index Status API" here, to see the
    raw data: http://localhost:9200/_status?recovery=true
  • The first value is indeed "time" spent recovering the shard, and the
    second value is the size in MB/GB of the index (see the raw data)

These numbers indeed stay the same once the shard has been successfuly
loaded/recovered. If one of the shards of the same index is too big
compared to the other shards (of the same index), you may be using the
routing feature in a sub-optimal way, creating a "hot shard" with too
much data.

Karel

On Wednesday, January 9, 2013 9:44:23 PM UTC+1, Drew Raines wrote:

Tim Bunce wrote:

We're using Paramedic (https://github.com/karmi/elasticsearch-paramedic)

and finding it helpful to monitor our ElasticSearch cluster.

[...]

Most of the figures are very low, say 20ms and 58b, but the shards
for one row (node) show very high figures, say 52s and 1012mb. (I
also once saw high values in just one column, i.e., one shard
across several nodes.)

It's possible it may be due to the particular way we're loading and
querying the data, but I can't be sure without knowing more about
the numbers.

I'd be grateful if someone could shed some light on them.

They're related to recovery, which is ES's term for initializing
shards and making them ready for use. I believe the numbers are only
updated when the shard is first recovered. Mine do not seem to
update for ongoing replication.

The primary shards (blue) will probably never have large numbers
because they don't need to recover data (unless a replica has been
promoted to a primary). And you may small numbers on replica shards
if they were created at index time.

From my usage, the numbers seems to be large when a replica was
created from a large-ish primary shard and had to be recovered with a
non-trivial amount of data. In your case, that 52s/1G was likely an
initialized shard on a remote node, which is effectively 19.69MBps.
That may or may not be acceptable depending on your network, but it's
likely normal operation.

-Drew

--

On Thu, Jan 10, 2013 at 06:45:50AM -0800, Karel Minařík wrote:

To add to Drew's earlier explanation:

  • Paramedic is just an interface to the "Index Status API" here, to see the raw
    data: http://localhost:9200/_status?recovery=true
  • The first value is indeed "time" spent recovering the shard, and the second value is the size in MB/GB
    of the index (see the raw data)
    These numbers indeed stay the same once the shard has been successfuly loaded/recovered. If one of the
    shards of the same index is too big compared to the other shards (of the same index), you may be using
    the routing feature in a sub-optimal way, creating a "hot shard" with too much data.

Thank you both for the detailed replies.

Tim.

--