Rip apart my cluster architecture, would you? ;)

So we've been using ES for a while now, and I have an architecture I've set
up that I'm absolutely not 100% sure is right. I'd like to lay it out and
see if anyone can tell me where I might be going wrong.

We have, as our data set, roughly 10 million documents. Each one represents
a product and then a bunch of data on that product suitable for queries.
Our queries are pretty good (because someone else writes them :-)) and we
get the results we want.

We have five nodes. Three are in one data center (call it data center M)
and two are in another (call it data center B). There is a nice, fat pipe
between the two so communication is acceptable.

I replicate every shard on every node. We have plenty of disk space, the
data set isn't so huge that it fills up memory, and I really do want to
optimize for reads. The reason for that is that we re-load our index once
per day in the middle of the night.

To do this, I create a new index, load all the data, and then move an index
alias from the old to the new. No downtime. I wrote a job that loads the
data via the bulk API. I'm pretty happy with this, too.

In the M data center, machine M1 is the one I use to load the data. It is
NOT in our load balancing rotation for reads. Machines M2 and M3 are, as
are both machines in data center B.

All M machines are master=true data=true. All B machines are master=false
data=true. The reason I made B machines master=false was so that while
building the new index nightly on M1, it doesn't have to go to a B machine
as the master. I presume this is wise. I'm not sure.

I write in batches of 2000 documents and get about 1300 documents per
second on write speeds.

I also have ONE job that does scripted upserts in batches of 1000 each that
gets about 300 documents per second. This is slower than I'd like. I'm
unsure how I might speed this up.

So... anything stand out as bad?

Could I maybe speed up writes by turning replication off while writing and
then back on when done, so that my cluster isn't updating every node during
the writes? Since I keep the index alias pointed at the previous index
until the new one is ready, this should be okay, right?

Anything I might be missing?

THANK YOU TONS if you can chime in. ES is wonderful, but as we all know,
there's a lot to learn!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9348104d-efa7-42ae-baac-f1c63d849e6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cross DC clusters are not recommended as ES is very latency sensitive and
you may find that some of the write delay is because data has to traverse
this link to the other nodes.
It'd be better if you could index to each cluster separately, or use
snapshot and restore to copy it over.

Turning off replicas while reindexing is definitely a good idea; you can
also turn off indexing (refresh: -1) until it's been uploaded, then enable,
then turn back on replicas, then switch your alias.

On 19 November 2014 03:25, Christopher Ambler const.dogberry@gmail.com
wrote:

So we've been using ES for a while now, and I have an architecture I've
set up that I'm absolutely not 100% sure is right. I'd like to lay it out
and see if anyone can tell me where I might be going wrong.

We have, as our data set, roughly 10 million documents. Each one
represents a product and then a bunch of data on that product suitable for
queries. Our queries are pretty good (because someone else writes them :-))
and we get the results we want.

We have five nodes. Three are in one data center (call it data center M)
and two are in another (call it data center B). There is a nice, fat pipe
between the two so communication is acceptable.

I replicate every shard on every node. We have plenty of disk space, the
data set isn't so huge that it fills up memory, and I really do want to
optimize for reads. The reason for that is that we re-load our index once
per day in the middle of the night.

To do this, I create a new index, load all the data, and then move an
index alias from the old to the new. No downtime. I wrote a job that loads
the data via the bulk API. I'm pretty happy with this, too.

In the M data center, machine M1 is the one I use to load the data. It is
NOT in our load balancing rotation for reads. Machines M2 and M3 are, as
are both machines in data center B.

All M machines are master=true data=true. All B machines are master=false
data=true. The reason I made B machines master=false was so that while
building the new index nightly on M1, it doesn't have to go to a B machine
as the master. I presume this is wise. I'm not sure.

I write in batches of 2000 documents and get about 1300 documents per
second on write speeds.

I also have ONE job that does scripted upserts in batches of 1000 each
that gets about 300 documents per second. This is slower than I'd like. I'm
unsure how I might speed this up.

So... anything stand out as bad?

Could I maybe speed up writes by turning replication off while writing and
then back on when done, so that my cluster isn't updating every node during
the writes? Since I keep the index alias pointed at the previous index
until the new one is ready, this should be okay, right?

Anything I might be missing?

THANK YOU TONS if you can chime in. ES is wonderful, but as we all know,
there's a lot to learn!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9348104d-efa7-42ae-baac-f1c63d849e6c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9348104d-efa7-42ae-baac-f1c63d849e6c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF3ZnZk4i7Cj9Fms5TzyzgnjOJsfBhOH34Rgvonq24doQMG7iw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

just out of curiosity why do you re index each night?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5087b39c-973e-4dd8-8070-3b4544de8ad8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Our data set changes constantly, but a refresh every 24 hours is sufficient
to our needs.

I could use a river or some kind of data loader to keep up to date, but
it's really not necessary if I just create a new index once a day.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/27d5a875-270d-4bff-88b1-d3a6049241fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.