Shards allocation in cluster on the same node

Bastien_Chong · February 14, 2014, 5:43pm

I have 2 ES nodes configured with 2 shards, 0 replica. I'm testing how fast
logstash can push logs from a dummy "access_log" file to this clusters.

From my test, with m3.xlarge on EC2, I can push around 4500 logs/sec. But I
noticed that my 2 shards were on the same node. I still don't get how ES
black magic works, why does it not split the shards ? Will that allows me
to push 9000/sec ?

I can't find if logstash or ES is the bottleneck there.

/usr/bin/java -Xms7g -Xmx7g -Xss256k -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-0.90.10.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.ElasticSearch

/usr/bin/java -Xmx1G -Xms1G -cp
/usr/local/bin/logstash/logstash.jar:/usr/local/bin/logstash/cloud-aws/*
logstash.runner agent --config /etc/logstash/mylogstash.conf --log
/var/log/logstash/logstash.log

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0abf0a40-94c4-4bae-adb4-5ecebc1ebb2a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly · February 14, 2014, 5:53pm

Shards should distribute over the 2 nodes assuming they are part of a
single cluster. Theoretically, yes more shards distributed across multiple
nodes will increase indexing speed. But you can still be limited by other
resources such as network, CPU, memory so it's hard to say how much exactly
will your throughput be.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/292bc034-f884-4373-be5d-5cb87b1fded4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bastien_Chong · February 14, 2014, 7:10pm

I managed to split the shards by restarting ES on the master, then
retested. Throughput is the same.

4500/sec seems a bit low, each doc is just 8k. Network doesn't seems to be
the bottleneck. I check the IO on disk, and it's between 0 (probably when
it's buffering before flushing, and 50/70). Do you think I should get
Provisionned IO on my EC2 instance ?

On Friday, February 14, 2014 12:53:11 PM UTC-5, Binh Ly wrote:

Shards should distribute over the 2 nodes assuming they are part of a
single cluster. Theoretically, yes more shards distributed across multiple
nodes will increase indexing speed. But you can still be limited by other
resources such as network, CPU, memory so it's hard to say how much exactly
will your throughput be.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd2c61ca-8e6a-4961-8dfd-ea5c7cb4b563%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bastien_Chong · February 14, 2014, 7:28pm

I provisioned an IO 300 disk, no improvement at all.

Logstash is running on the same instance as the master node.

On Friday, February 14, 2014 2:10:47 PM UTC-5, Bastien Chong wrote:

I managed to split the shards by restarting ES on the master, then
retested. Throughput is the same.

4500/sec seems a bit low, each doc is just 8k. Network doesn't seems to be
the bottleneck. I check the IO on disk, and it's between 0 (probably when
it's buffering before flushing, and 50/70). Do you think I should get
Provisionned IO on my EC2 instance ?

On Friday, February 14, 2014 12:53:11 PM UTC-5, Binh Ly wrote:

Shards should distribute over the 2 nodes assuming they are part of a
single cluster. Theoretically, yes more shards distributed across multiple
nodes will increase indexing speed. But you can still be limited by other
resources such as network, CPU, memory so it's hard to say how much exactly
will your throughput be.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a56f8f9c-8b09-4980-9b84-210522fe7300%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly · February 14, 2014, 7:45pm

It's hard to diagnose things offline, but is it possible for you to run
another logstash somewhere else (like maybe on the second box) and both of
them in parallel and see what your combined ES throughput is. So they would
be both writing to the same single ES cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aa945a7d-532c-4348-bc87-5d37bc7f1cd9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bastien_Chong · February 18, 2014, 8:45pm

Even with 2 logstashs instance writing to the same cluster, it's not faster.

On Friday, February 14, 2014 2:45:47 PM UTC-5, Binh Ly wrote:

It's hard to diagnose things offline, but is it possible for you to run
another logstash somewhere else (like maybe on the second box) and both of
them in parallel and see what your combined ES throughput is. So they would
be both writing to the same single ES cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3897f091-3117-4b28-b05f-3b5dcc5c7c4b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tony_Su · February 18, 2014, 9:22pm

On my testing (5 virtual nodes on a single box), I've been observing pretty
much the same.
I've tried pointing logstash to different nodes in the cluster and getting
approx. the same performance.

My main objective was to determine if by inserting the data into the
cluster on different nodes whether the data would be distributed
differently. Surprisingly, I've fond so far there is little difference at
the cluster level (distribution of shards and data across nodes) although
the details (actual shard locations on which nodes) would be different. In
other words, the overall "uneven-ness" of data was surprisingly almost
identical with each try although the uneven-ness was typically
different.

And, I also found that inserting data into more than one node at once
didn't seem to make a diff.

One possibility is that your two EC2 instances might be running on the same
hardware which could explain our similar results?
I remember sitting a presentation years ago about this and how that person
"encouraged" EC2 to deploy nodes on different hardware. I don't remember
the details, I only remember that person determined he couldn't make it a
certainty but could tilt the odds so much in his favor (6:1?) that his VMs
would usually be on different hardware.

Tony

On Tuesday, February 18, 2014 12:45:32 PM UTC-8, Bastien Chong wrote:

Even with 2 logstashs instance writing to the same cluster, it's not
faster.

On Friday, February 14, 2014 2:45:47 PM UTC-5, Binh Ly wrote:

It's hard to diagnose things offline, but is it possible for you to run
another logstash somewhere else (like maybe on the second box) and both of
them in parallel and see what your combined ES throughput is. So they would
be both writing to the same single ES cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e85d8071-3424-4c0c-a35f-a87a52bad20a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Logstash-Elasticsearch looking for advise on cluster and shard/index routing Elasticsearch	3	471	July 6, 2017
Shards get not distributed across the cluster Elasticsearch	1	450	July 6, 2017
Shards per CPU Elasticsearch	5	4117	July 5, 2017
Index/Shard Design on ES cluster Elasticsearch	2	389	July 6, 2017
Setting up elasticsearch to scale: shards per index Elasticsearch	9	481	July 6, 2017

Shards allocation in cluster on the same node

Related topics