Problems with tcp connections

Hi

I have two servers running an elasticsearch cluster as our website's search
engine. And use Elastica as our php client.

At beginning, the queries are sent directly to ES but the servers are very
unstable and the tcp connections are about 500-600, so ES can't handle them
quickly and always get timeout response (We set the timeout to 5s). So we
added 5mins cache with memcached and the situation got better. The tcp
connections are controlled around 10 (avg).

I found that if the connections over 100 then it will become very unstable.

Does this because the server can't handle too much request? Or I need to
optimize my queries? (Most queries took about 50 ms)

Here is a gist of the node stats at some
point. https://gist.github.com/1369446

Did you update your limits.conf? The number of acceptable connections
might be maxed out and hence why you are getting the timeouts.

On Nov 15, 10:58 pm, Ocean Wu darkyo...@gmail.com wrote:

Hi

I have two servers running an elasticsearch cluster as our website's search
engine. And use Elastica as our php client.

At beginning, the queries are sent directly to ES but the servers are very
unstable and the tcp connections are about 500-600, so ES can't handle them
quickly and always get timeout response (We set the timeout to 5s). So we
added 5mins cache with memcached and the situation got better. The tcp
connections are controlled around 10 (avg).

I found that if the connections over 100 then it will become very unstable.

Does this because the server can't handle too much request? Or I need to
optimize my queries? (Most queries took about 50 ms)

Here is a gist of the node stats at some
point.https://gist.github.com/1369446

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to 655360.

Thanks for reply.

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/1861b5c253982c75

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu darkyo...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to 655360.

Thanks for reply.

Can you try and use 0.18.3, see if it helps? It might be related to the
connection problem while searching fix.

On Wed, Nov 16, 2011 at 11:04 AM, electic electic@gmail.com wrote:

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/1861b5c253982c75

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu darkyo...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to

Thanks for reply.

Sweet. Okay, I am running a test now. Will report on any changes.

On Nov 16, 6:29 am, Shay Banon kim...@gmail.com wrote:

Can you try and use 0.18.3, see if it helps? It might be related to the
connection problem while searching fix.

On Wed, Nov 16, 2011 at 11:04 AM, electic elec...@gmail.com wrote:

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/18...

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu darkyo...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to

Thanks for reply.

Seems better after I upgrade to 0.18.3.

在 2011年11月16日星期三UTC+8下午10时29分01秒,kimchy写道:

Can you try and use 0.18.3, see if it helps? It might be related to the
connection problem while searching fix.

On Wed, Nov 16, 2011 at 11:04 AM, electic ele...@gmail.com wrote:

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/1861b5c253982c75

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu dark...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to

Thanks for reply.

So I still seem to be having the same issue. I have two machines with
16GB RAM each on them. A 10,000 RPM drive. As the datasize increased
to about 20GB total, 20 million documents, the queries seem to be
taking longer and the connections start to backup until it no longer
seems to be taking HTTP requests.

The logs show nothing. There are no huge CPU usage, or heap usage,
just dead. Any ideas on what I can paste here in terms of logs to
debug the issue?

On Nov 16, 5:53 pm, Ocean Wu darkyo...@gmail.com wrote:

Seems better after I upgrade to 0.18.3.

在 2011年11月16日星期三UTC+8下午10时29分01秒,kimchy写道:

Can you try and use 0.18.3, see if it helps? It might be related to the
connection problem while searching fix.

On Wed, Nov 16, 2011 at 11:04 AM, electic ele...@gmail.com wrote:

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/18...

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu dark...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to

Thanks for reply.

Here is my status after restarting the second node (the node that
handles all the query requests):

https://raw.github.com/gist/1374154/dc4df73f7fecb81491823ea7c51a6e00fa2c2ae3/gistfile1.txt

On Nov 17, 10:48 am, electic elec...@gmail.com wrote:

So I still seem to be having the same issue. I have two machines with
16GB RAM each on them. A 10,000 RPM drive. As the datasize increased
to about 20GB total, 20 million documents, the queries seem to be
taking longer and the connections start to backup until it no longer
seems to be taking HTTP requests.

The logs show nothing. There are no huge CPU usage, or heap usage,
just dead. Any ideas on what I can paste here in terms of logs to
debug the issue?

On Nov 16, 5:53 pm, Ocean Wu darkyo...@gmail.com wrote:

Seems better after I upgrade to 0.18.3.

在 2011年11月16日星期三UTC+8下午10时29分01秒,kimchy写道:

Can you try and use 0.18.3, see if it helps? It might be related to the
connection problem while searching fix.

On Wed, Nov 16, 2011 at 11:04 AM, electic ele...@gmail.com wrote:

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/18...

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu dark...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to

Thanks for reply.

I think this might have something to do with the merge policy. It is
happening around 20GB. Any ideas?

On Nov 17, 11:21 am, electic elec...@gmail.com wrote:

Here is my status after restarting the second node (the node that
handles all the query requests):

https://raw.github.com/gist/1374154/dc4df73f7fecb81491823ea7c51a6e00f...

On Nov 17, 10:48 am, electic elec...@gmail.com wrote:

So I still seem to be having the same issue. I have two machines with
16GB RAM each on them. A 10,000 RPM drive. As the datasize increased
to about 20GB total, 20 million documents, the queries seem to be
taking longer and the connections start to backup until it no longer
seems to be taking HTTP requests.

The logs show nothing. There are no huge CPU usage, or heap usage,
just dead. Any ideas on what I can paste here in terms of logs to
debug the issue?

On Nov 16, 5:53 pm, Ocean Wu darkyo...@gmail.com wrote:

Seems better after I upgrade to 0.18.3.

在 2011年11月16日星期三UTC+8下午10时29分01秒,kimchy写道:

Can you try and use 0.18.3, see if it helps? It might be related to the
connection problem while searching fix.

On Wed, Nov 16, 2011 at 11:04 AM, electic ele...@gmail.com wrote:

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/18...

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu dark...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to

Thanks for reply.

Are you using connections with keep alive / persistent connection? If you
open and close connections constantly, maybe the OS is throttling them?

On Thu, Nov 17, 2011 at 8:48 PM, electic electic@gmail.com wrote:

So I still seem to be having the same issue. I have two machines with
16GB RAM each on them. A 10,000 RPM drive. As the datasize increased
to about 20GB total, 20 million documents, the queries seem to be
taking longer and the connections start to backup until it no longer
seems to be taking HTTP requests.

The logs show nothing. There are no huge CPU usage, or heap usage,
just dead. Any ideas on what I can paste here in terms of logs to
debug the issue?

On Nov 16, 5:53 pm, Ocean Wu darkyo...@gmail.com wrote:

Seems better after I upgrade to 0.18.3.

在 2011年11月16日星期三UTC+8下午10时29分01秒,kimchy写道:

Can you try and use 0.18.3, see if it helps? It might be related to the
connection problem while searching fix.

On Wed, Nov 16, 2011 at 11:04 AM, electic ele...@gmail.com wrote:

Then we are having the same issue:

https://groups.google.com/group/elasticsearch/browse_thread/thread/18.
..

I notice when my total index size exceeds the RAM size (16GB of ram
per machine) the queries start to take a bit longer. Once the
connections pile up the entire cluster becomes massively unstable and
crashes. I have a theory as the dataset goes up in size what was once
a fast query suddenly is slow (my queries fetch data from a certain
time and sort) and I think that might be killing the cluster.

-R

On Nov 16, 12:08 am, Ocean Wu dark...@gmail.com wrote:

Yes, the limits.conf set to 32000, and net.nf_conntrack_max set to

Thanks for reply.