Anyone has experiences doing 'load-balanced' write's(and read's)?

I am planning on load-balancing my read's and write's across a 2 replicas, 5 shards cluster.

Here's an example of my python code:

from pyes import ES
conn=ES(['primary.1:9200', 'replica.1:9200', 'replica.2:9200'])

Write's: conn.index(doc, index, doc_type, id=id, bulk=True)
Read's: conn.search(query, indexes=[index])
conn.get(...)

The reason for load-balanced write's/read's is the primary box is seeing 80% CPU spikes quite often. We have pretty high read/write traffic. All boxes are 64GB memory, 8 core EC2 instances. I was hoping a multiple server list in the ES() call will automagically distribute the read's and write's.

Questions:

  1. I searched for previous posts on read's and came across this thread: http://elasticsearch-users.115913.n3.nabble.com/How-to-fix-primary-replica-inconsistency-td4022692.html#a4024176 which I just posted a question to. Is this still an issue? I'm running 0.19.2
  2. I couldn't find any threads on write's - not the way I'm planning anyway. Please share your experiences and advice.
  3. Am I doing it right(from the above code snippet)?

Thanks.

Hey,

On Thursday, October 18, 2012 8:11:09 AM UTC+2, es_learner wrote:

I am planning on load-balancing my read's and write's across a 2 replicas,
5
shards cluster.

Here's an example of my python code:

from pyes import ES
conn=ES(['primary.1:9200', 'replica.1:9200', 'replica.2:9200'])

Write's: conn.index(doc, index, doc_type, id=id, bulk=True)
Read's: conn.search(query, indexes=[index])
conn.get(...)

The reason for load-balanced write's/read's is the primary box is seeing
80%
CPU spikes quite often. We have pretty high read/write traffic. All
boxes
are 64GB memory, 8 core EC2 instances. I was hoping a multiple server
list
in the ES() call will automagically distribute the read's and write's.

Just out of curiosity, what is wrong with 80% CPU? I usually get worries if
they are not using enough CPU.

simon

Questions:

  1. I searched for previous posts on read's and came across this thread:

http://elasticsearch-users.115913.n3.nabble.com/How-to-fix-primary-replica-inconsistency-td4022692.html#a4024176
which I just posted a question to. Is this still an issue? I'm running
0.19.2
2) I couldn't find any threads on write's - not the way I'm planning
anyway.
Please share your experiences and advice.
3) Am I doing it right(from the above code snippet)?

Thanks.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Anyone-has-experiences-doing-load-balanced-write-s-and-read-s-tp4024177.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

Just out of curiosity, what is wrong with 80% CPU? I usually get worries if they are not using enough CPU.

Lol! That's true. When you buy or rent a computer, it's best to use it :wink:

--

Hello,

The Python code seems right to me, pyes should automatically take a
random address out of the list for each request, as far as I've seen
in the code.

If you have a higher load on one of the nodes, I don't think looking
at primary shards vs replicas is the way to go. It might just be a
coincidence that the server with primary shards is the most busy,
because it shouldn't be a significant difference. I've had a similar
problem and was looking for a way to balance primary shards, until I
realized that it was a storage performance issue on the "hot" server,
and that it was actually holding replicas, not primary shards :slight_smile:

I would consider adding a node without data that would act as a load balancer:

If that doesn't fix it, it's likely that the problem doesn't rely in
the ES layer.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Oct 18, 2012 at 9:11 AM, es_learner dave@livefyre.com wrote:

I am planning on load-balancing my read's and write's across a 2 replicas, 5
shards cluster.

Here's an example of my python code:

from pyes import ES
conn=ES(['primary.1:9200', 'replica.1:9200', 'replica.2:9200'])

Write's: conn.index(doc, index, doc_type, id=id, bulk=True)
Read's: conn.search(query, indexes=[index])
conn.get(...)

The reason for load-balanced write's/read's is the primary box is seeing 80%
CPU spikes quite often. We have pretty high read/write traffic. All boxes
are 64GB memory, 8 core EC2 instances. I was hoping a multiple server list
in the ES() call will automagically distribute the read's and write's.

Questions:

  1. I searched for previous posts on read's and came across this thread:
    http://elasticsearch-users.115913.n3.nabble.com/How-to-fix-primary-replica-inconsistency-td4022692.html#a4024176
    which I just posted a question to. Is this still an issue? I'm running
    0.19.2
  2. I couldn't find any threads on write's - not the way I'm planning anyway.
    Please share your experiences and advice.
  3. Am I doing it right(from the above code snippet)?

Thanks.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Anyone-has-experiences-doing-load-balanced-write-s-and-read-s-tp4024177.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

What can I say? I'm a nice guy :slight_smile:

The way I look at replicas is for failover too. If the primary fails and its 80% spikes were to failover to the remaining 2 replicas, then those nodes would see 120% spikes which is not tenable, and might take down the whole cluster in a short time before ops can add more nodes. Comfort zone is ~55% IMO.

Radu, thanks for the gem! That should help.

And in your use-case, did you experience any inconsistencies in reads or writes (or both)?

We have an SLA to provide live data within a couple of secs - that's one part of the consistency. The other part is we do a conn.get() call (or conn.exists()) to check for existence and lazy-create if not. So, any inconsistency would trigger multiple creates/writes (which is fine from an integrity point because each doc uses uniq _id) and we do not want to add additional indexing load to the system.

Hello,

On Thu, Oct 18, 2012 at 8:55 PM, es_learner dave@livefyre.com wrote:

Radu, thanks for the gem! That should help.

You're welcome :slight_smile:

And in your use-case, did you experience any inconsistencies in reads or
writes (or both)?

No, I didn't experience any inconsistencies between primaries and
replicas. But I didn't use async replication while indexing. I suppose
turning that on might open a small window of inconsistency under load.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Hi All,

I'm a colleague of 'es_learner', working on the same problem. A couple of
points and a couple more thoughts we have.

First off, the recent jump in CPU usage was when we introduced some new
features in our platform. We're looking to find ways to scale ES out
reliably. Currently, we have three indices spread across three servers, but
we restrict traffic for each index to a specific server we label internally
responsible for than index. If one node goes down, we can cut traffic over
to another machine, which does ~double the traffic, until we restore that
broken node. We want to spread our traffic across all three nodes reliably,
so we can add more nodes as our needs scale out.

Also, we have evidence that the spikes themselves might be related to some
settings we can tune better. Also, node sounds like a lovely thing for us
to try out. In my book, it's always best when each server is doing one and
only one thing.

Something else we're worried about are some of the issues brought up in an
older thread:

http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-td891925.html

It seems that as long as we ensure sync for all operations,
we shouldn't have any of the network partition events. The default settings
for index and update should both ensure that at least a quorum is visible
during AWS flakiness.

Thanks for all the recommendations.

-Yaakov M Nemoy

On Thursday, October 18, 2012 12:38:00 PM UTC-7, Radu Gheorghe wrote:

Hello,

On Thu, Oct 18, 2012 at 8:55 PM, es_learner <da...@livefyre.com<javascript:>>
wrote:

Radu, thanks for the gem! That should help.

You're welcome :slight_smile:

And in your use-case, did you experience any inconsistencies in reads or
writes (or both)?

No, I didn't experience any inconsistencies between primaries and
replicas. But I didn't use async replication while indexing. I suppose
turning that on might open a small window of inconsistency under load.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Hey Radu,

What was your load balancer config? In terms of EC2 instances if that is what you use. e.g. XL=15GB memory and 4 cores?

Our average traffic is 1 write and 1 read every 20 millisecs

Thanks.

Hello,

I haven't used EC2 yet, and in my case the uneven load was caused by a
storage problem which had to be fixed. So I didn't even need a load
balancer, because the overhead on the "entry point" caused by traffic
was negligible. However, we only had about 1 query per second, but we
had 1-2K writes per second.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Mon, Oct 22, 2012 at 3:29 AM, es_learner dave@livefyre.com wrote:

Hey Radu,

What was your load balancer config? In terms of EC2 instances if that is
what you use. e.g. XL=15GB memory and 4 cores?

Our average traffic is 1 write and 1 read every 20 millisecs

Thanks.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Anyone-has-experiences-doing-load-balanced-write-s-and-read-s-tp4024177p4024337.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--