Shards allocation in cluster on the same node

I have 2 ES nodes configured with 2 shards, 0 replica. I'm testing how fast
logstash can push logs from a dummy "access_log" file to this clusters.

From my test, with m3.xlarge on EC2, I can push around 4500 logs/sec. But I
noticed that my 2 shards were on the same node. I still don't get how ES
black magic works, why does it not split the shards ? Will that allows me
to push 9000/sec ?

I can't find if logstash or ES is the bottleneck there.

/usr/bin/java -Xms7g -Xmx7g -Xss256k -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-0.90.10.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.ElasticSearch

/usr/bin/java -Xmx1G -Xms1G -cp
/usr/local/bin/logstash/logstash.jar:/usr/local/bin/logstash/cloud-aws/*
logstash.runner agent --config /etc/logstash/mylogstash.conf --log
/var/log/logstash/logstash.log

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0abf0a40-94c4-4bae-adb4-5ecebc1ebb2a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Shards should distribute over the 2 nodes assuming they are part of a
single cluster. Theoretically, yes more shards distributed across multiple
nodes
will increase indexing speed. But you can still be limited by other
resources such as network, CPU, memory so it's hard to say how much exactly
will your throughput be.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/292bc034-f884-4373-be5d-5cb87b1fded4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I managed to split the shards by restarting ES on the master, then
retested. Throughput is the same.

4500/sec seems a bit low, each doc is just 8k. Network doesn't seems to be
the bottleneck. I check the IO on disk, and it's between 0 (probably when
it's buffering before flushing, and 50/70). Do you think I should get
Provisionned IO on my EC2 instance ?

On Friday, February 14, 2014 12:53:11 PM UTC-5, Binh Ly wrote:

Shards should distribute over the 2 nodes assuming they are part of a
single cluster. Theoretically, yes more shards distributed across multiple
nodes
will increase indexing speed. But you can still be limited by other
resources such as network, CPU, memory so it's hard to say how much exactly
will your throughput be.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd2c61ca-8e6a-4961-8dfd-ea5c7cb4b563%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I provisioned an IO 300 disk, no improvement at all.

Logstash is running on the same instance as the master node.

On Friday, February 14, 2014 2:10:47 PM UTC-5, Bastien Chong wrote:

I managed to split the shards by restarting ES on the master, then
retested. Throughput is the same.

4500/sec seems a bit low, each doc is just 8k. Network doesn't seems to be
the bottleneck. I check the IO on disk, and it's between 0 (probably when
it's buffering before flushing, and 50/70). Do you think I should get
Provisionned IO on my EC2 instance ?

On Friday, February 14, 2014 12:53:11 PM UTC-5, Binh Ly wrote:

Shards should distribute over the 2 nodes assuming they are part of a
single cluster. Theoretically, yes more shards distributed across multiple
nodes
will increase indexing speed. But you can still be limited by other
resources such as network, CPU, memory so it's hard to say how much exactly
will your throughput be.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a56f8f9c-8b09-4980-9b84-210522fe7300%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It's hard to diagnose things offline, but is it possible for you to run
another logstash somewhere else (like maybe on the second box) and both of
them in parallel and see what your combined ES throughput is. So they would
be both writing to the same single ES cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aa945a7d-532c-4348-bc87-5d37bc7f1cd9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Even with 2 logstashs instance writing to the same cluster, it's not faster.

On Friday, February 14, 2014 2:45:47 PM UTC-5, Binh Ly wrote:

It's hard to diagnose things offline, but is it possible for you to run
another logstash somewhere else (like maybe on the second box) and both of
them in parallel and see what your combined ES throughput is. So they would
be both writing to the same single ES cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3897f091-3117-4b28-b05f-3b5dcc5c7c4b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On my testing (5 virtual nodes on a single box), I've been observing pretty
much the same.
I've tried pointing logstash to different nodes in the cluster and getting
approx. the same performance.

My main objective was to determine if by inserting the data into the
cluster on different nodes whether the data would be distributed
differently. Surprisingly, I've fond so far there is little difference at
the cluster level (distribution of shards and data across nodes) although
the details (actual shard locations on which nodes) would be different. In
other words, the overall "uneven-ness" of data was surprisingly almost
identical with each try although the uneven-ness was typically
different.

And, I also found that inserting data into more than one node at once
didn't seem to make a diff.

One possibility is that your two EC2 instances might be running on the same
hardware which could explain our similar results?
I remember sitting a presentation years ago about this and how that person
"encouraged" EC2 to deploy nodes on different hardware. I don't remember
the details, I only remember that person determined he couldn't make it a
certainty but could tilt the odds so much in his favor (6:1?) that his VMs
would usually be on different hardware.

Tony

On Tuesday, February 18, 2014 12:45:32 PM UTC-8, Bastien Chong wrote:

Even with 2 logstashs instance writing to the same cluster, it's not
faster.

On Friday, February 14, 2014 2:45:47 PM UTC-5, Binh Ly wrote:

It's hard to diagnose things offline, but is it possible for you to run
another logstash somewhere else (like maybe on the second box) and both of
them in parallel and see what your combined ES throughput is. So they would
be both writing to the same single ES cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e85d8071-3424-4c0c-a35f-a87a52bad20a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.