Does ES throughput/latency get bottlenecked by the slowest host in the cluster?


(T Vinod Gupta) #1

i have a cluster of 2 nodes on aws - 1 m1.large and 1 m1.xlarge. i
allocated 4GB on the former and 8GB on the latter for ES. i also have a
m1.large non-data non-master node in addition to act as a load balancer. we
are seeing high latency on search queries. indexing is happening at a very
high rate in parallel.

i was wondering if the asymmetry in the topology is causing any slowness.
the cpu usage is high in both nodes due to ES process.

searches are taking over 5-10 seconds. trying hard to figure out how to
speed up the queries.

thanks

--


(Jörg Prante) #2

Query response times in the order of 5-10 seconds can not be made much
faster just because of different-sized nodes, there must be other reasons.

Right now, according to shard management, ES assumes all nodes to be equal
regarding JVM capacity, and it is not a good thing to set up indexes in
production spanning over machines with different capacity. The relation of
a shard size and the JVM power under it will got skewed.

At query time, ES selects the responses from shards that return fastest, so
if all of your large JVM hold a replica, you won't get much faster response
with an index distributed over equal nodes I'm afraid of.

Best regards,

Jörg

On Monday, September 17, 2012 6:58:56 PM UTC+2, T Vinod Gupta wrote:

i have a cluster of 2 nodes on aws - 1 m1.large and 1 m1.xlarge. i
allocated 4GB on the former and 8GB on the latter for ES. i also have a
m1.large non-data non-master node in addition to act as a load balancer. we
are seeing high latency on search queries. indexing is happening at a very
high rate in parallel.

i was wondering if the asymmetry in the topology is causing any slowness.
the cpu usage is high in both nodes due to ES process.

searches are taking over 5-10 seconds. trying hard to figure out how to
speed up the queries.

thanks

--


(Clinton Gormley) #3

At query time, ES selects the responses from shards that return
fastest,

Is this true? I thought it just round-robin'ed around all available
shards. So if you have one shard on a fast server and a replica on a
slow server, you would get alternating fast and slow responses.

clint

--


(T Vinod Gupta) #4

thanks.
what are the "other reasons" that could be impacting my high search
latency? is there something that i should look at in bigdesk that can
identify the problem source? my m1.xlarge is using ebs.. could it be slow
disk access. if yes, how do i prove it?
also, if the run the same query (query + filter) twice in a row, its not
much faster the 2nd time. i thought filter cache would get into the
picture.
besides, im using dfs query then fetch.. could that be the reason?

thanks

On Mon, Sep 17, 2012 at 10:52 AM, Jörg Prante joergprante@gmail.com wrote:

Query response times in the order of 5-10 seconds can not be made much
faster just because of different-sized nodes, there must be other reasons.

Right now, according to shard management, ES assumes all nodes to be equal
regarding JVM capacity, and it is not a good thing to set up indexes in
production spanning over machines with different capacity. The relation of
a shard size and the JVM power under it will got skewed.

At query time, ES selects the responses from shards that return fastest,
so if all of your large JVM hold a replica, you won't get much faster
response with an index distributed over equal nodes I'm afraid of.

Best regards,

Jörg

On Monday, September 17, 2012 6:58:56 PM UTC+2, T Vinod Gupta wrote:

i have a cluster of 2 nodes on aws - 1 m1.large and 1 m1.xlarge. i
allocated 4GB on the former and 8GB on the latter for ES. i also have a
m1.large non-data non-master node in addition to act as a load balancer. we
are seeing high latency on search queries. indexing is happening at a very
high rate in parallel.

i was wondering if the asymmetry in the topology is causing any slowness.
the cpu usage is high in both nodes due to ES process.

searches are taking over 5-10 seconds. trying hard to figure out how to
speed up the queries.

thanks

--

--


(Clinton Gormley) #5

Hiya

what are the "other reasons" that could be impacting my high search
latency? is there something that i should look at in bigdesk that can
identify the problem source? my m1.xlarge is using ebs.. could it be
slow disk access. if yes, how do i prove it?

I suggest gisting a couple of example docs, plus the query that you are
using. It may be that you are querying inefficiently. Alternatively,
it may be that your disks are slow, or you have too much data for your
current cluster.

besides, im using dfs query then fetch.. could that be the reason?

Using dfs is only really necessary with small numbers of docs. With
large amounts of data, the difference in term distribution should even
itself out.

clint

--


(Shay Banon) #6

I suggest first start by using the same instance types for the actual data notes. m1.xlarge is a good start, possibly with enabling the new EBS optimization.

Gisting the search requests you execute would help to see what you are doing. Try not to use DFS to begin with.

On Sep 18, 2012, at 11:26 AM, Clinton Gormley clint@traveljury.com wrote:

Hiya

what are the "other reasons" that could be impacting my high search
latency? is there something that i should look at in bigdesk that can
identify the problem source? my m1.xlarge is using ebs.. could it be
slow disk access. if yes, how do i prove it?

I suggest gisting a couple of example docs, plus the query that you are
using. It may be that you are querying inefficiently. Alternatively,
it may be that your disks are slow, or you have too much data for your
current cluster.

besides, im using dfs query then fetch.. could that be the reason?

Using dfs is only really necessary with small numbers of docs. With
large amounts of data, the difference in term distribution should even
itself out.

clint

--

--


(Otis Gospodnetić) #7

Hi,

On Monday, September 17, 2012 3:53:41 PM UTC-4, T Vinod Gupta wrote:

thanks.
what are the "other reasons" that could be impacting my high search
latency? is there something that i should look at in bigdesk that can
identify the problem source? my m1.xlarge is using ebs.. could it be slow
disk access. if yes, how do i prove it?

Do you have something that monitors system metrics like swap, disk IO,
etc.? (if not, check http://sematext.com/spm/index.html )
You could also run vmstat and such for a quick and dirty check.

Otis
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Mon, Sep 17, 2012 at 10:52 AM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:

Query response times in the order of 5-10 seconds can not be made much
faster just because of different-sized nodes, there must be other reasons.

Right now, according to shard management, ES assumes all nodes to be
equal regarding JVM capacity, and it is not a good thing to set up indexes
in production spanning over machines with different capacity. The relation
of a shard size and the JVM power under it will got skewed.

At query time, ES selects the responses from shards that return fastest,
so if all of your large JVM hold a replica, you won't get much faster
response with an index distributed over equal nodes I'm afraid of.

Best regards,

Jörg

On Monday, September 17, 2012 6:58:56 PM UTC+2, T Vinod Gupta wrote:

i have a cluster of 2 nodes on aws - 1 m1.large and 1 m1.xlarge. i
allocated 4GB on the former and 8GB on the latter for ES. i also have a
m1.large non-data non-master node in addition to act as a load balancer. we
are seeing high latency on search queries. indexing is happening at a very
high rate in parallel.

i was wondering if the asymmetry in the topology is causing any
slowness. the cpu usage is high in both nodes due to ES process.

searches are taking over 5-10 seconds. trying hard to figure out how to
speed up the queries.

thanks

--

--


(T Vinod Gupta) #8

thanks for all the answers.. i was able to make a big dent by doing 2
things -

  1. not using dfs anymore..
  2. i also noticed that both instances were swapping.. i freed up some heap
    space by shutting down other processes and flushed the swap file.
    this made a big difference.

next step is to follow your advice and move to m1.xlarge.

thanks

On Tue, Sep 18, 2012 at 3:01 AM, Shay Banon kimchy@gmail.com wrote:

I suggest first start by using the same instance types for the actual data
notes. m1.xlarge is a good start, possibly with enabling the new EBS
optimization.

Gisting the search requests you execute would help to see what you are
doing. Try not to use DFS to begin with.

On Sep 18, 2012, at 11:26 AM, Clinton Gormley clint@traveljury.com
wrote:

Hiya

what are the "other reasons" that could be impacting my high search
latency? is there something that i should look at in bigdesk that can
identify the problem source? my m1.xlarge is using ebs.. could it be
slow disk access. if yes, how do i prove it?

I suggest gisting a couple of example docs, plus the query that you are
using. It may be that you are querying inefficiently. Alternatively,
it may be that your disks are slow, or you have too much data for your
current cluster.

besides, im using dfs query then fetch.. could that be the reason?

Using dfs is only really necessary with small numbers of docs. With
large amounts of data, the difference in term distribution should even
itself out.

clint

--

--

--


(system) #9