A few scaling questions

All:

I have a few scaling questions that I wanted to get input on. First a
little background, our company uses ES currently at an older version,
0.17.7, due to scheduling concerns, taking down the index to do major
version upgrades is very difficult for us and we have use across an older
code base that we will be shutting down next month before we upgrade to
0.19.x. We have 6 current indices varying in size from a single small
document that will likely not grow to an index with around 700k documents
that grows a few thousand a day, all the way up to 500 million documents,
this largest index grows at around 750k documents a day, all of the
documents in question are rather small. We run our index across 6 nodes
with 12 shards per index and 1 replica to allow for growth and redundancy,
I have no idea if this is suboptimal, if we should be using a different
number of shards for the smaller indexes, if we should be using more shards
for the larger one, etc.

We are having periodic issues with our current setup, some of which I am
sure are addressed in the versions we have yet to upgrade to. One that hit
me yesterday is our cluster ended up with 3 nodes thinking they were the
master in the same cluster, the only way I was able to resolve this was to
restart the offending nodes, and ultimately any other nodes that agreed
that the false master nodes were the master, this lead to some data loss as
I was not able to keep all the shards/replicas for a given index up before
restarting the offending nodes. While we still have the source data to
recreate the documents the process to do so requires walking a filesystem
and doing shallow parses on millions of files, it will take several days to
weeks to figure out what data is missing and replace it. I don't know if I
hit a bug or if I don't have things configured correctly. I am considering
storing the documents as raw json in Postgres as well just to increase the
speed of recovery in case the cluster loses data again. All of our data
nodes are configured the same and I will include the yaml file at the end
of this post.

Additionally durring times that the largest index is collapsing segments
and times when a node fails and shards need to be reallocated our java
clients get timeouts when trying to post data to the cluster. When all
shards have been reallocated or the index finishes being optimized these
issues clear up. I can't immagine this is intended behavior so I clearly
have done something wrong, I just don't know what.

Finally, what is the best way to back up ES and for that matter safely
restore from backup either in the case of a disaster or for testing
purposes to another cluster? Our index is so spread out and we can't really
suffer shutting it down, our largest index is over 500 GB so any rsync
strategy will take a long time and have to come from several boxes
depending on where the shards are.

If anyone could provide some best practice pointers for a setup like mine I
would greatly appreciate it. Please refrain from just saying "upgrade to
0.19.x" as that is already in the works, I'm more looking to make sure I'm
not doing something inadvertently very stupid.

Thanks in advance for any suggestions,

--Dan

-- yaml config follows
cluster:
name: grid
path:
logs: /var/log/es
data: /var/grid/index
bootstrap:
mlockall: true
index:
term_index_divisor: 4
number_of_shards: 12
number_of_replicas: 1
merge:
scheduler:
max_thread_count: 4
discovery:
zen:
ping_timeout: 30s
jmx:
create_connector: true
transport:
netty:
worker_count: 64

On 13 June 2012 10:21, dostrow dostrow@dicomgrid.com wrote:

While we still have the source data to recreate the documents the process
to do so requires walking a filesystem and doing shallow parses on millions
of files, it will take several days to weeks to figure out what data is
missing and replace it. I don't know if I hit a bug or if I don't have
things configured correctly. I am considering storing the documents as raw
json in Postgres as well just to increase the speed of recovery in case the
cluster loses data again. All of our data nodes are configured the same and
I will include the yaml file at the end of this post.

This might not help you right now, but if your source contains an ID &
Version concept, this is exactly why we wrote Scrutineer:
github.com/Aconex/scrutineer . It can tell you very quickly which items in
your ES index are not matching your source info.

It sounds like your source information is from Postgres? If your source
tables have an ID (presumably yes) and you can work out a 'version' concept
(either a lastupdated timestamp or a common Pessimistic Locking version
flag), then use ES's External Version type and store that in ES.

Scrutineer can very quickly detect any inconsistencies, only a simple
2-node ES cluster with a pretty ordinary IO (RAID-1, don't ask...),
verifying 38 million items takes just 5minutes.

This can be also advantageous when you want to switch to a new cluster and
minimise downtime:

  • Load up a gateway snapshot backup on the new cluster
  • at switch time, run Scrutineer against the new ES cluster, pointing the
    source at your production DB, allowing you to feed the Scrutineer output
    into a tool to 'freshen' your side-by-side cluster
  • switch the cluster

No need to reindex everything again, just what was missing or out of date.

Worth considering anyway.

cheers,

Paul Smith

Questions/Comments inline :

I have a few scaling questions that I wanted to get input on. First a
little background, our company uses ES currently at an older version,
0.17.7, due to scheduling concerns, taking down the index to do major
version upgrades is very difficult for us and we have use across an older
code base that we will be shutting down next month before we upgrade to
0.19.x. We have 6 current indices varying in size from a single small
document that will likely not grow to an index with around 700k documents
that grows a few thousand a day, all the way up to 500 million documents,
this largest index grows at around 750k documents a day, all of the
documents in question are rather small. We run our index across 6 nodes
with 12 shards per index and 1 replica to allow for growth and redundancy,
I have no idea if this is suboptimal, if we should be using a different
number of shards for the smaller indexes, if we should be using more shards
for the larger one, etc.

I'm not sure what your search pattern is, but have you considered splitting
that data into 'monthly', or even 'daily' indices? You could then use
aliases to group data, and even perhaps routing if your data format allows
for it easily. This should be something you'd be able to do without
actually modifying the view from the client side, and reasonably easy to
implement on a live cluster (simply add your new indice, create an alias to
point to the old one, and then the new one). If your data is only stored in
one place, using the gateway, and using more than 1 replica is certainly
advisable, the replication factor of important data should really be 2
IMHO... that allows for scaling reads, and rebuilding complete nodes
without having to always worry about being down to one replica, and risking
your data in said situation.

We are having periodic issues with our current setup, some of which I am
sure are addressed in the versions we have yet to upgrade to. One that hit
me yesterday is our cluster ended up with 3 nodes thinking they were the
master in the same cluster, the only way I was able to resolve this was to
restart the offending nodes, and ultimately any other nodes that agreed
that the false master nodes were the master, this lead to some data loss as
I was not able to keep all the shards/replicas for a given index up before
restarting the offending nodes. While we still have the source data to
recreate the documents the process to do so requires walking a filesystem
and doing shallow parses on millions of files, it will take several days to
weeks to figure out what data is missing and replace it. I don't know if I
hit a bug or if I don't have things configured correctly. I am considering
storing the documents as raw json in Postgres as well just to increase the
speed of recovery in case the cluster loses data again. All of our data
nodes are configured the same and I will include the yaml file at the end
of this post.

Personally, I haven't seen this go around, outside of perhaps the client
side misbehaving, however, without further information from the logs, and
perhaps any other system level information at the time, nothing stands out
to me. One of the other folks on the list may be able to respond with info
on a bug that they perhaps have seen. Have you considered moving to a
different method of discovery? Although, I can't for the life of me think
ES is misbehaving on the discovery part, unless there's something funky on
the network layer.

Additionally durring times that the largest index is collapsing segments
and times when a node fails and shards need to be reallocated our java
clients get timeouts when trying to post data to the cluster. When all
shards have been reallocated or the index finishes being optimized these
issues clear up. I can't immagine this is intended behavior so I clearly
have done something wrong, I just don't know what.

Again, this seems to be a discovery problem perhaps, I'll leave that to
other folks on the list, but I can certainly see some commonality between
this and the master voting problem you were describing earlier.

Finally, what is the best way to back up ES and for that matter safely
restore from backup either in the case of a disaster or for testing
purposes to another cluster? Our index is so spread out and we can't really
suffer shutting it down, our largest index is over 500 GB so any rsync
strategy will take a long time and have to come from several boxes
depending on where the shards are.

Have you considered using the gateway to a device that offloads backup?
Perhaps local, or even NFS to another box?

If anyone could provide some best practice pointers for a setup like mine
I would greatly appreciate it. Please refrain from just saying "upgrade to
0.19.x" as that is already in the works, I'm more looking to make sure I'm
not doing something inadvertently very stupid.

The biggest reason you'll get the 'upgrade to 0.19.x' is that you're going
to hit bugs, and it's sometimes hard to tell if it's a bug, or an issue
with the deployment/installation without huge amounts of information. That
said, sometimes it's pretty easy to see (i.e. there'll be an error, or a
very well known situation). Getting your scaling factors right with ES is
something more of an art, than a science, and it takes knowing your
workload, your builds, your infrastructure, and your clients. That said,
Shay puts a lot of work into performance improvements each release, and you
should see a pretty significant jump when you upgrade from what I recall in
the release notes!

I don't seem to see anything about memory, or run configuration? Or
anything about configuration of disks, or system stats, any chance you
could send those along too ?

Adding to the answers:

  1. The master election problem is much better in 0.19, with the ability to
    configure discovery.minimum_master_nodes (in your case, with 6 nodes, I
    would configure it to 3).
  2. Recovery is considerably better in 0.19, requiring less resources.
    Merges still take their toll, but they are a bit more optimized in 0.19.
  3. Simplest way to do a backup is to disable translog flush, and copy over
    hte data location of each node. We do plan to have a proper backup/restore
    API in the future.

I know, the points are suggesting 0.19 upgrade, but those improvements are
important from previous versions...

On Wed, Jun 13, 2012 at 3:33 AM, Patrick patrick@eefy.net wrote:

Questions/Comments inline :

I have a few scaling questions that I wanted to get input on. First a
little background, our company uses ES currently at an older version,
0.17.7, due to scheduling concerns, taking down the index to do major
version upgrades is very difficult for us and we have use across an older
code base that we will be shutting down next month before we upgrade to
0.19.x. We have 6 current indices varying in size from a single small
document that will likely not grow to an index with around 700k documents
that grows a few thousand a day, all the way up to 500 million documents,
this largest index grows at around 750k documents a day, all of the
documents in question are rather small. We run our index across 6 nodes
with 12 shards per index and 1 replica to allow for growth and redundancy,
I have no idea if this is suboptimal, if we should be using a different
number of shards for the smaller indexes, if we should be using more shards
for the larger one, etc.

I'm not sure what your search pattern is, but have you considered
splitting that data into 'monthly', or even 'daily' indices? You could then
use aliases to group data, and even perhaps routing if your data format
allows for it easily. This should be something you'd be able to do without
actually modifying the view from the client side, and reasonably easy to
implement on a live cluster (simply add your new indice, create an alias to
point to the old one, and then the new one). If your data is only stored in
one place, using the gateway, and using more than 1 replica is certainly
advisable, the replication factor of important data should really be 2
IMHO... that allows for scaling reads, and rebuilding complete nodes
without having to always worry about being down to one replica, and risking
your data in said situation.

We are having periodic issues with our current setup, some of which I am
sure are addressed in the versions we have yet to upgrade to. One that hit
me yesterday is our cluster ended up with 3 nodes thinking they were the
master in the same cluster, the only way I was able to resolve this was to
restart the offending nodes, and ultimately any other nodes that agreed
that the false master nodes were the master, this lead to some data loss as
I was not able to keep all the shards/replicas for a given index up before
restarting the offending nodes. While we still have the source data to
recreate the documents the process to do so requires walking a filesystem
and doing shallow parses on millions of files, it will take several days to
weeks to figure out what data is missing and replace it. I don't know if I
hit a bug or if I don't have things configured correctly. I am considering
storing the documents as raw json in Postgres as well just to increase the
speed of recovery in case the cluster loses data again. All of our data
nodes are configured the same and I will include the yaml file at the end
of this post.

Personally, I haven't seen this go around, outside of perhaps the client
side misbehaving, however, without further information from the logs, and
perhaps any other system level information at the time, nothing stands out
to me. One of the other folks on the list may be able to respond with info
on a bug that they perhaps have seen. Have you considered moving to a
different method of discovery? Although, I can't for the life of me think
ES is misbehaving on the discovery part, unless there's something funky on
the network layer.

Additionally durring times that the largest index is collapsing segments
and times when a node fails and shards need to be reallocated our java
clients get timeouts when trying to post data to the cluster. When all
shards have been reallocated or the index finishes being optimized these
issues clear up. I can't immagine this is intended behavior so I clearly
have done something wrong, I just don't know what.

Again, this seems to be a discovery problem perhaps, I'll leave that to
other folks on the list, but I can certainly see some commonality between
this and the master voting problem you were describing earlier.

Finally, what is the best way to back up ES and for that matter safely
restore from backup either in the case of a disaster or for testing
purposes to another cluster? Our index is so spread out and we can't really
suffer shutting it down, our largest index is over 500 GB so any rsync
strategy will take a long time and have to come from several boxes
depending on where the shards are.

Have you considered using the gateway to a device that offloads backup?
Perhaps local, or even NFS to another box?

If anyone could provide some best practice pointers for a setup like mine
I would greatly appreciate it. Please refrain from just saying "upgrade to
0.19.x" as that is already in the works, I'm more looking to make sure I'm
not doing something inadvertently very stupid.

The biggest reason you'll get the 'upgrade to 0.19.x' is that you're
going to hit bugs, and it's sometimes hard to tell if it's a bug, or an
issue with the deployment/installation without huge amounts of information.
That said, sometimes it's pretty easy to see (i.e. there'll be an error, or
a very well known situation). Getting your scaling factors right with ES is
something more of an art, than a science, and it takes knowing your
workload, your builds, your infrastructure, and your clients. That said,
Shay puts a lot of work into performance improvements each release, and you
should see a pretty significant jump when you upgrade from what I recall in
the release notes!

I don't seem to see anything about memory, or run configuration? Or
anything about configuration of disks, or system stats, any chance you
could send those along too ?