Abrupt performance drop above a certain index size

Hi,

After inserting a lot of documents in elasticsearch (when the index
size gets to about 4-5 times the RAM size), the performance on both
inserting and querying drops abruptly. By "abruptly", I mean there is
a thin line between returning a query in 1 second, and getting a
timeout from the browser.

It seems pretty obvious that the RAM is not enough for something (the
inverted index?), because the slower the storage, the more abrupt is
the performance drop.

I have two questions:

  • what is the specific reason that makes ES go suddenly from super-
    fast to super-slow?
  • is there something I can do to make it performance drop less abrupt?

The point is that I'm having trouble estimating what hardware I need,
when I need to account spikes. It's kind of hard to find a balance
between budgeting 4 servers and praying it will work and budgeting 10
servers "for just in case".

Thanks in advance!
Radu

P.S. Some information about my setup:

  • indexing log lines of about 1K each
  • default "local" gateway
  • default 1s refresh rate
  • using the Python "pyes" over HTTP for both insert and query
  • 1 replica per shard
  • first look in the logs didn't help

But I can reproduce the problem and provide any potentially useful
info if it's needed.

To try and help, need more info especially number of indices / shards you have, and number of nodes you are running. Also, what is abrupt, I did not understand a thin line between 1 second and timeout, what is the timeout?

On Wednesday, February 29, 2012 at 5:01 PM, Radu Gheorghe wrote:

Hi,

After inserting a lot of documents in elasticsearch (when the index
size gets to about 4-5 times the RAM size), the performance on both
inserting and querying drops abruptly. By "abruptly", I mean there is
a thin line between returning a query in 1 second, and getting a
timeout from the browser.

It seems pretty obvious that the RAM is not enough for something (the
inverted index?), because the slower the storage, the more abrupt is
the performance drop.

I have two questions:

  • what is the specific reason that makes ES go suddenly from super-
    fast to super-slow?
  • is there something I can do to make it performance drop less abrupt?

The point is that I'm having trouble estimating what hardware I need,
when I need to account spikes. It's kind of hard to find a balance
between budgeting 4 servers and praying it will work and budgeting 10
servers "for just in case".

Thanks in advance!
Radu

P.S. Some information about my setup:

  • indexing log lines of about 1K each
  • default "local" gateway
  • default 1s refresh rate
  • using the Python "pyes" over HTTP for both insert and query
  • 1 replica per shard
  • first look in the logs didn't help

But I can reproduce the problem and provide any potentially useful
info if it's needed.

Hi,

I'm having 7 indices - one for each day of logging. So I'm always
inserting on the last index. Each index has 5 shards with one replica.

Last test was on 2 nodes, each with 8 cores and 20GB of RAM. The
performance test went like this:

  • I'm inserting on 3 threads 5000 log lines at a time with bulk
    indexing
  • after inserting a million logs, I'm checking the query performance.
    If 5 queries in a row return in more than 4 seconds (spikes happen, so
    I wanted to rule that out), my "loading" script would stop

Up to 70M documents all went brilliant. So I left the test running
overnight expecting to see where it stops the next day.

Next day, the state was like this:

  • 192M logs were indexed
  • all inserts were timing out. This means it couldn't index 3x5000 log
    lines in 30 seconds
  • the test script was hanged in a query, so I just stopped it

When I looked in the logs, I found out that all queries were returned
in 1 second or so, and the last one in 18 seconds. So I queried for a
string, and I got no result in about 10 minutes, when I gave up.
Looking at the inserts, some of them were timing out when it passed
100M documents, but only rarely (1 out of 10-20 times).

The last thing I did was to restart both ES instances. Then a query
for a string would return in a few seconds, but a query that would do
"match_all" and sort by date (to give me the last 50 logs) would still
go for more than 10 minutes, when I've stopped it.

Memory was of course full of cache, but the CPU and disk activities
were surprisingly low (let's say under 10% of capacity). It all looked
like a hang to me.

My conclusion was that up until 190M logs, inserts were slowly getting
worse, and queries were very fast. But after that it all went
unusable. At least the queries, I haven't tried inserting less
documents.

I think that's about all the relevant information I can think of. If
you need more info, I can remake the test next week and provide some
more. I'm running on 0.18.7, but I can retry with 0.19.0 if you think
it makes a difference.

Thanks!
Radu

On Mar 1, 1:49 pm, Shay Banon kim...@gmail.com wrote:

To try and help, need more info especially number of indices / shards you have, and number of nodes you are running. Also, what is abrupt, I did not understand a thin line between 1 second and timeout, what is the timeout?

On Wednesday, February 29, 2012 at 5:01 PM, Radu Gheorghe wrote:

Hi,

After inserting a lot of documents in elasticsearch (when the index
size gets to about 4-5 times the RAM size), the performance on both
inserting and querying drops abruptly. By "abruptly", I mean there is
a thin line between returning a query in 1 second, and getting a
timeout from the browser.

It seems pretty obvious that the RAM is not enough for something (the
inverted index?), because the slower the storage, the more abrupt is
the performance drop.

I have two questions:

  • what is the specific reason that makes ES go suddenly from super-
    fast to super-slow?
  • is there something I can do to make it performance drop less abrupt?

The point is that I'm having trouble estimating what hardware I need,
when I need to account spikes. It's kind of hard to find a balance
between budgeting 4 servers and praying it will work and budgeting 10
servers "for just in case".

Thanks in advance!
Radu

P.S. Some information about my setup:

  • indexing log lines of about 1K each
  • default "local" gateway
  • default 1s refresh rate
  • using the Python "pyes" over HTTP for both insert and query
  • 1 replica per shard
  • first look in the logs didn't help

But I can reproduce the problem and provide any potentially useful
info if it's needed.

  • Are you allocating enough memory to elasticsearch, or are you using the defaults?
  • You say you have 7 indices, is that for the 192m logs? Or the test you conducted was against a single index?
  • Did you see any OOM (Out of Memory) failures in the logs? Can you use big desk to see how memory is used by the instances?

Obviously, there is a limit to how much data you can push into a 2 node cluster. It mainly comes from the overhead each shard has (the inverted index state it has), and things like field data cache (which is used when sorting).

On Friday, March 2, 2012 at 1:13 PM, Radu Gheorghe wrote:

Hi,

I'm having 7 indices - one for each day of logging. So I'm always
inserting on the last index. Each index has 5 shards with one replica.

Last test was on 2 nodes, each with 8 cores and 20GB of RAM. The
performance test went like this:

  • I'm inserting on 3 threads 5000 log lines at a time with bulk
    indexing
  • after inserting a million logs, I'm checking the query performance.
    If 5 queries in a row return in more than 4 seconds (spikes happen, so
    I wanted to rule that out), my "loading" script would stop

Up to 70M documents all went brilliant. So I left the test running
overnight expecting to see where it stops the next day.

Next day, the state was like this:

  • 192M logs were indexed
  • all inserts were timing out. This means it couldn't index 3x5000 log
    lines in 30 seconds
  • the test script was hanged in a query, so I just stopped it

When I looked in the logs, I found out that all queries were returned
in 1 second or so, and the last one in 18 seconds. So I queried for a
string, and I got no result in about 10 minutes, when I gave up.
Looking at the inserts, some of them were timing out when it passed
100M documents, but only rarely (1 out of 10-20 times).

The last thing I did was to restart both ES instances. Then a query
for a string would return in a few seconds, but a query that would do
"match_all" and sort by date (to give me the last 50 logs) would still
go for more than 10 minutes, when I've stopped it.

Memory was of course full of cache, but the CPU and disk activities
were surprisingly low (let's say under 10% of capacity). It all looked
like a hang to me.

My conclusion was that up until 190M logs, inserts were slowly getting
worse, and queries were very fast. But after that it all went
unusable. At least the queries, I haven't tried inserting less
documents.

I think that's about all the relevant information I can think of. If
you need more info, I can remake the test next week and provide some
more. I'm running on 0.18.7, but I can retry with 0.19.0 if you think
it makes a difference.

Thanks!
Radu

On Mar 1, 1:49 pm, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

To try and help, need more info especially number of indices / shards you have, and number of nodes you are running. Also, what is abrupt, I did not understand a thin line between 1 second and timeout, what is the timeout?

On Wednesday, February 29, 2012 at 5:01 PM, Radu Gheorghe wrote:

Hi,

After inserting a lot of documents in elasticsearch (when the index
size gets to about 4-5 times the RAM size), the performance on both
inserting and querying drops abruptly. By "abruptly", I mean there is
a thin line between returning a query in 1 second, and getting a
timeout from the browser.

It seems pretty obvious that the RAM is not enough for something (the
inverted index?), because the slower the storage, the more abrupt is
the performance drop.

I have two questions:

  • what is the specific reason that makes ES go suddenly from super-
    fast to super-slow?
  • is there something I can do to make it performance drop less abrupt?

The point is that I'm having trouble estimating what hardware I need,
when I need to account spikes. It's kind of hard to find a balance
between budgeting 4 servers and praying it will work and budgeting 10
servers "for just in case".

Thanks in advance!
Radu

P.S. Some information about my setup:

  • indexing log lines of about 1K each
  • default "local" gateway
  • default 1s refresh rate
  • using the Python "pyes" over HTTP for both insert and query
  • 1 replica per shard
  • first look in the logs didn't help

But I can reproduce the problem and provide any potentially useful
info if it's needed.

Hi,

On Mar 4, 12:00 am, Shay Banon kim...@gmail.com wrote:

  • Are you allocating enough memory to elasticsearch, or are you using the defaults?

I'm using the defaults. Would increasing ES_MAX_MEM help?

  • You say you have 7 indices, is that for the 192m logs? Or the test you conducted was against a single index?

Most logs were spread on two indices, because the test went for two
days. First one had ~60M, the second ~130M.

  • Did you see any OOM (Out of Memory) failures in the logs? Can you use big desk to see how memory is used by the instances?

I've checked the logs and I've checked big desk and I haven't seen any
relevant symptom. But it might have been me failing here. I will
double-check the logs tomorrow for OOM errors and report back.

Obviously, there is a limit to how much data you can push into a 2 node cluster. It mainly comes from the overhead each shard has (the inverted index state it has), and things like field data cache (which is used when sorting).

Can you say some more about this? How can I calculate this limit?

Also, what would be the recommended settings for memory in this case?
Would ES slow down, but still work, if some of its memory gets
swapped?

On Friday, March 2, 2012 at 1:13 PM, Radu Gheorghe wrote:

Hi,

I'm having 7 indices - one for each day of logging. So I'm always
inserting on the last index. Each index has 5 shards with one replica.

Last test was on 2 nodes, each with 8 cores and 20GB of RAM. The
performance test went like this:

  • I'm inserting on 3 threads 5000 log lines at a time with bulk
    indexing
  • after inserting a million logs, I'm checking the query performance.
    If 5 queries in a row return in more than 4 seconds (spikes happen, so
    I wanted to rule that out), my "loading" script would stop

Up to 70M documents all went brilliant. So I left the test running
overnight expecting to see where it stops the next day.

Next day, the state was like this:

  • 192M logs were indexed
  • all inserts were timing out. This means it couldn't index 3x5000 log
    lines in 30 seconds
  • the test script was hanged in a query, so I just stopped it

When I looked in the logs, I found out that all queries were returned
in 1 second or so, and the last one in 18 seconds. So I queried for a
string, and I got no result in about 10 minutes, when I gave up.
Looking at the inserts, some of them were timing out when it passed
100M documents, but only rarely (1 out of 10-20 times).

The last thing I did was to restart both ES instances. Then a query
for a string would return in a few seconds, but a query that would do
"match_all" and sort by date (to give me the last 50 logs) would still
go for more than 10 minutes, when I've stopped it.

Memory was of course full of cache, but the CPU and disk activities
were surprisingly low (let's say under 10% of capacity). It all looked
like a hang to me.

My conclusion was that up until 190M logs, inserts were slowly getting
worse, and queries were very fast. But after that it all went
unusable. At least the queries, I haven't tried inserting less
documents.

I think that's about all the relevant information I can think of. If
you need more info, I can remake the test next week and provide some
more. I'm running on 0.18.7, but I can retry with 0.19.0 if you think
it makes a difference.

Thanks!
Radu

On Mar 1, 1:49 pm, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

To try and help, need more info especially number of indices / shards you have, and number of nodes you are running. Also, what is abrupt, I did not understand a thin line between 1 second and timeout, what is the timeout?

On Wednesday, February 29, 2012 at 5:01 PM, Radu Gheorghe wrote:

Hi,

After inserting a lot of documents in elasticsearch (when the index
size gets to about 4-5 times the RAM size), the performance on both
inserting and querying drops abruptly. By "abruptly", I mean there is
a thin line between returning a query in 1 second, and getting a
timeout from the browser.

It seems pretty obvious that the RAM is not enough for something (the
inverted index?), because the slower the storage, the more abrupt is
the performance drop.

I have two questions:

  • what is the specific reason that makes ES go suddenly from super-
    fast to super-slow?
  • is there something I can do to make it performance drop less abrupt?

The point is that I'm having trouble estimating what hardware I need,
when I need to account spikes. It's kind of hard to find a balance
between budgeting 4 servers and praying it will work and budgeting 10
servers "for just in case".

Thanks in advance!
Radu

P.S. Some information about my setup:

  • indexing log lines of about 1K each
  • default "local" gateway
  • default 1s refresh rate
  • using the Python "pyes" over HTTP for both insert and query
  • 1 replica per shard
  • first look in the logs didn't help

But I can reproduce the problem and provide any potentially useful
info if it's needed.

  • Did you see any OOM (Out of Memory) failures in the logs? Can you use big desk to see how memory is used by the instances?

I've checked the logs and I've checked big desk and I haven't seen any
relevant symptom. But it might have been me failing here. I will
double-check the logs tomorrow for OOM errors and report back.

I've double-checked the logs and I actually got quite a lot of these:

java.lang.OutOfMemoryError: Java heap space

I can't believe I missed them in the first place :frowning:

So, I will try again with a bigger ES_MAX_MEM in bin/
elasticsearch.in.sh. Would it hurt if I set it to something huge like
"20g"?

Yea, so I suggest setting the memory settings for ES. In 0.19, there is a simple env var that you can set called ES_HEAP_SIZE (see more here: Elasticsearch Platform — Find real-time answers at scale | Elastic, which sets both the min and max to the same value). I recommend you set it to about half the memory you have on the machines (can you share some details on those?).

On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:

  • Did you see any OOM (Out of Memory) failures in the logs? Can you use big desk to see how memory is used by the instances?

I've checked the logs and I've checked big desk and I haven't seen any
relevant symptom. But it might have been me failing here. I will
double-check the logs tomorrow for OOM errors and report back.

I've double-checked the logs and I actually got quite a lot of these:

java.lang.OutOfMemoryError: Java heap space

I can't believe I missed them in the first place :frowning:

So, I will try again with a bigger ES_MAX_MEM in bin/
elasticsearch.in.sh (http://elasticsearch.in.sh). Would it hurt if I set it to something huge like
"20g"?

Thanks, Shay!

I will try with those settings. My nodes have 8 cores and 20GB of RAM.

I've started a test with 15 to 25GB min and max sizes (don't have the
results yet). I will try again with ES_HEAP_SIZE=10GB and see what the
results are.

On 5 mar., 17:21, Shay Banon kim...@gmail.com wrote:

Yea, so I suggest setting the memory settings for ES. In 0.19, there is a simple env var that you can set called ES_HEAP_SIZE (see more here:Elasticsearch Platform — Find real-time answers at scale | Elastic, which sets both the min and max to the same value). I recommend you set it to about half the memory you have on the machines (can you share some details on those?).

On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:

  • Did you see any OOM (Out of Memory) failures in the logs? Can you use big desk to see how memory is used by the instances?

I've checked the logs and I've checked big desk and I haven't seen any
relevant symptom. But it might have been me failing here. I will
double-check the logs tomorrow for OOM errors and report back.

I've double-checked the logs and I actually got quite a lot of these:

java.lang.OutOfMemoryError: Java heap space

I can't believe I missed them in the first place :frowning:

So, I will try again with a bigger ES_MAX_MEM in bin/
elasticsearch.in.sh (http://elasticsearch.in.sh). Would it hurt if I set it to something huge like
"20g"?

Don't use 25gb on a machine that only has 20gb of RAM. It should always be lower (since swapping does not work well for GC based systems).

On Tuesday, March 6, 2012 at 7:44 AM, Radu Gheorghe wrote:

Thanks, Shay!

I will try with those settings. My nodes have 8 cores and 20GB of RAM.

I've started a test with 15 to 25GB min and max sizes (don't have the
results yet). I will try again with ES_HEAP_SIZE=10GB and see what the
results are.

On 5 mar., 17:21, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

Yea, so I suggest setting the memory settings for ES. In 0.19, there is a simple env var that you can set called ES_HEAP_SIZE (see more here:Elasticsearch Platform — Find real-time answers at scale | Elastic, which sets both the min and max to the same value). I recommend you set it to about half the memory you have on the machines (can you share some details on those?).

On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:

  • Did you see any OOM (Out of Memory) failures in the logs? Can you use big desk to see how memory is used by the instances?

I've checked the logs and I've checked big desk and I haven't seen any
relevant symptom. But it might have been me failing here. I will
double-check the logs tomorrow for OOM errors and report back.

I've double-checked the logs and I actually got quite a lot of these:

java.lang.OutOfMemoryError: Java heap space

I can't believe I missed them in the first place :frowning:

So, I will try again with a bigger ES_MAX_MEM in bin/
elasticsearch.in.sh (http://elasticsearch.in.sh). Would it hurt if I set it to something huge like
"20g"?

Hi Shay,

Why do you recommend only using half? I learned from somewhere, I can't
remember where, that 3/4 was safe, even allowing for the OS processes and
file system caching. Have you run tests regarding this or heard of someone
who has?

Thanks,
Mark

On Monday, March 5, 2012 10:33:21 PM UTC-8, kimchy wrote:

Don't use 25gb on a machine that only has 20gb of RAM. It should always
be lower (since swapping does not work well for GC based systems).

On Tuesday, March 6, 2012 at 7:44 AM, Radu Gheorghe wrote:

Thanks, Shay!

I will try with those settings. My nodes have 8 cores and 20GB of RAM.

I've started a test with 15 to 25GB min and max sizes (don't have the
results yet). I will try again with ES_HEAP_SIZE=10GB and see what the
results are.

On 5 mar., 17:21, Shay Banon kim...@gmail.com wrote:

Yea, so I suggest setting the memory settings for ES. In 0.19, there is a
simple env var that you can set called ES_HEAP_SIZE (see more here:
Elasticsearch Platform — Find real-time answers at scale | Elastic,
which sets both the min and max to the same value). I recommend you set it
to about half the memory you have on the machines (can you share some
details on those?).

On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:

  • Did you see any OOM (Out of Memory) failures in the logs? Can you use
    big desk to see how memory is used by the instances?

I've checked the logs and I've checked big desk and I haven't seen any
relevant symptom. But it might have been me failing here. I will
double-check the logs tomorrow for OOM errors and report back.

I've double-checked the logs and I actually got quite a lot of these:

java.lang.OutOfMemoryError: Java heap space

I can't believe I missed them in the first place :frowning:

So, I will try again with a bigger ES_MAX_MEM in bin/
elasticsearch.in.sh (http://elasticsearch.in.sh). Would it hurt if I set
it to something huge like
"20g"?

On Tue, 2012-03-06 at 00:10 -0800, Mark Waddle wrote:

Hi Shay,

Why do you recommend only using half? I learned from somewhere, I
can't remember where, that 3/4 was safe, even allowing for the OS
processes and file system caching. Have you run tests regarding this
or heard of someone who has?

The right number depends... I have 32GB machines, and am using a heap
size of 24GB. I arrived at that by starting out using 16GB and watching
how much memory the kernel used for file caching.

I typically had 8GB of free memory, so I gave that to ES.

These numbers should probably be revisited as your data grows

clint

Thanks,
Mark

On Monday, March 5, 2012 10:33:21 PM UTC-8, kimchy wrote:
Don't use 25gb on a machine that only has 20gb of RAM. It
should always be lower (since swapping does not work well for
GC based systems).

    On Tuesday, March 6, 2012 at 7:44 AM, Radu Gheorghe wrote:
    
    > Thanks, Shay!
    > 
    > 
    > I will try with those settings. My nodes have 8 cores and
    > 20GB of RAM.
    > 
    > 
    > I've started a test with 15 to 25GB min and max sizes (don't
    > have the
    > results yet). I will try again with ES_HEAP_SIZE=10GB and
    > see what the
    > results are.
    > 
    > 
    > On 5 mar., 17:21, Shay Banon <kim...@gmail.com> wrote:
    > > Yea, so I suggest setting the memory settings for ES. In
    > > 0.19, there is a simple env var that you can set called
    > > ES_HEAP_SIZE (see more
    > > here:http://www.elasticsearch.org/guide/reference/setup/installation.html, which sets both the min and max to the same value). I recommend you set it to about half the memory you have on the machines (can you share some details on those?).
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:
    > > > > > - Did you see any OOM (Out of Memory) failures in
    > > > > > the logs? Can you use big desk to see how memory is
    > > > > > used by the instances?
    > > 
    > > 
    > > > > I've checked the logs and I've checked big desk and I
    > > > > haven't seen any
    > > > > relevant symptom. But it might have been me failing
    > > > > here. I will
    > > > > double-check the logs tomorrow for OOM errors and
    > > > > report back.
    > > 
    > > 
    > > > I've double-checked the logs and I actually got quite a
    > > > lot of these:
    > > 
    > > 
    > > > java.lang.OutOfMemoryError: Java heap space
    > > 
    > > 
    > > > I can't believe I missed them in the first place :(
    > > 
    > > 
    > > > So, I will try again with a bigger ES_MAX_MEM in bin/
    > > > elasticsearch.in.sh (http://elasticsearch.in.sh). Would
    > > > it hurt if I set it to something huge like
    > > > "20g"?

The main point is not to allocate more to the JVM than you have RAM available, since you never want the JVM to swap. This is because how the garbage collector works, and when it runs, it "touches" many areas of the memory, so it will cause swap trashing. I usually recommend using half so there is also enough memory for things like OS file system cache.

On Tuesday, March 6, 2012 at 11:08 AM, Clinton Gormley wrote:

On Tue, 2012-03-06 at 00:10 -0800, Mark Waddle wrote:

Hi Shay,

Why do you recommend only using half? I learned from somewhere, I
can't remember where, that 3/4 was safe, even allowing for the OS
processes and file system caching. Have you run tests regarding this
or heard of someone who has?

The right number depends... I have 32GB machines, and am using a heap
size of 24GB. I arrived at that by starting out using 16GB and watching
how much memory the kernel used for file caching.

I typically had 8GB of free memory, so I gave that to ES.

These numbers should probably be revisited as your data grows

clint

Thanks,
Mark

On Monday, March 5, 2012 10:33:21 PM UTC-8, kimchy wrote:
Don't use 25gb on a machine that only has 20gb of RAM. It
should always be lower (since swapping does not work well for
GC based systems).

On Tuesday, March 6, 2012 at 7:44 AM, Radu Gheorghe wrote:

Thanks, Shay!

I will try with those settings. My nodes have 8 cores and
20GB of RAM.

I've started a test with 15 to 25GB min and max sizes (don't
have the
results yet). I will try again with ES_HEAP_SIZE=10GB and
see what the
results are.

On 5 mar., 17:21, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

Yea, so I suggest setting the memory settings for ES. In
0.19, there is a simple env var that you can set called
ES_HEAP_SIZE (see more
here:Elasticsearch Platform — Find real-time answers at scale | Elastic, which sets both the min and max to the same value). I recommend you set it to about half the memory you have on the machines (can you share some details on those?).

On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:

  • Did you see any OOM (Out of Memory) failures in
    the logs? Can you use big desk to see how memory is
    used by the instances?

I've checked the logs and I've checked big desk and I
haven't seen any
relevant symptom. But it might have been me failing
here. I will
double-check the logs tomorrow for OOM errors and
report back.

I've double-checked the logs and I actually got quite a
lot of these:

java.lang.OutOfMemoryError: Java heap space

I can't believe I missed them in the first place :frowning:

So, I will try again with a bigger ES_MAX_MEM in bin/
elasticsearch.in.sh (http://elasticsearch.in.sh). Would
it hurt if I set it to something huge like
"20g"?

Thank you Shay and Clint for your responses.

@Clint: How did you determine how much memory your file system cache is
consuming? Hopefully you are using Linux like I am ...

On Tuesday, March 6, 2012 1:08:35 AM UTC-8, Clinton Gormley wrote:

On Tue, 2012-03-06 at 00:10 -0800, Mark Waddle wrote:

Hi Shay,

Why do you recommend only using half? I learned from somewhere, I
can't remember where, that 3/4 was safe, even allowing for the OS
processes and file system caching. Have you run tests regarding this
or heard of someone who has?

The right number depends... I have 32GB machines, and am using a heap
size of 24GB. I arrived at that by starting out using 16GB and watching
how much memory the kernel used for file caching.

I typically had 8GB of free memory, so I gave that to ES.

These numbers should probably be revisited as your data grows

clint

Thanks,
Mark

On Monday, March 5, 2012 10:33:21 PM UTC-8, kimchy wrote:
Don't use 25gb on a machine that only has 20gb of RAM. It
should always be lower (since swapping does not work well for
GC based systems).

    On Tuesday, March 6, 2012 at 7:44 AM, Radu Gheorghe wrote:
    
    > Thanks, Shay!
    > 
    > 
    > I will try with those settings. My nodes have 8 cores and
    > 20GB of RAM.
    > 
    > 
    > I've started a test with 15 to 25GB min and max sizes (don't
    > have the
    > results yet). I will try again with ES_HEAP_SIZE=10GB and
    > see what the
    > results are.
    > 
    > 
    > On 5 mar., 17:21, Shay Banon <kim...@gmail.com> wrote:
    > > Yea, so I suggest setting the memory settings for ES. In
    > > 0.19, there is a simple env var that you can set called
    > > ES_HEAP_SIZE (see more
    > > here:

Elasticsearch Platform — Find real-time answers at scale | Elastic,
which sets both the min and max to the same value). I recommend you set it
to about half the memory you have on the machines (can you share some
details on those?).

    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:
    > > > > > - Did you see any OOM (Out of Memory) failures in
    > > > > > the logs? Can you use big desk to see how memory is
    > > > > > used by the instances?
    > > 
    > > 
    > > > > I've checked the logs and I've checked big desk and I
    > > > > haven't seen any
    > > > > relevant symptom. But it might have been me failing
    > > > > here. I will
    > > > > double-check the logs tomorrow for OOM errors and
    > > > > report back.
    > > 
    > > 
    > > > I've double-checked the logs and I actually got quite a
    > > > lot of these:
    > > 
    > > 
    > > > java.lang.OutOfMemoryError: Java heap space
    > > 
    > > 
    > > > I can't believe I missed them in the first place :(
    > > 
    > > 
    > > > So, I will try again with a bigger ES_MAX_MEM in bin/
    > > > elasticsearch.in.sh (http://elasticsearch.in.sh). Would
    > > > it hurt if I set it to something huge like
    > > > "20g"?

@Mark: you can use 'free' (I use free -m to show megabytes). Or, if
you use htop, you can see that the brown/orange memory is the one used
for cache.

@Shay: thanks for your advice. I've set both min and max sizes to half
the RAM and it seems to work OK. Insert speed gradually goes slower
from about 100M documents, and queries go gradually slower as well.
I've gone up to about 210M where I stopped because the inserts were
too slow for my requirements. Also, a new query at that point would
take 20 seconds or so, but it gets pretty instant once it "warms
up" (do a few queries).

Also "bootstrap.mlockall: true" (with ulimit -l unlimited before
starting ES) seems to help, by making ES start a lot faster when it
has a lot of documents to load.

As Clinton says, fine-tuning should be done once I see how my
production data actually looks like.

Thanks again to everybody for their input to this thread. From my
point of view, this issue is solved.

On Mar 7, 7:16 am, Mark Waddle m...@markwaddle.com wrote:

Thank you Shay and Clint for your responses.

@Clint: How did you determine how much memory your file system cache is
consuming? Hopefully you are using Linux like I am ...

On Tuesday, March 6, 2012 1:08:35 AM UTC-8, Clinton Gormley wrote:

On Tue, 2012-03-06 at 00:10 -0800, Mark Waddle wrote:

Hi Shay,

Why do you recommend only using half? I learned from somewhere, I
can't remember where, that 3/4 was safe, even allowing for the OS
processes and file system caching. Have you run tests regarding this
or heard of someone who has?

The right number depends... I have 32GB machines, and am using a heap
size of 24GB. I arrived at that by starting out using 16GB and watching
how much memory the kernel used for file caching.

I typically had 8GB of free memory, so I gave that to ES.

These numbers should probably be revisited as your data grows

clint

Thanks,
Mark

On Monday, March 5, 2012 10:33:21 PM UTC-8, kimchy wrote:
Don't use 25gb on a machine that only has 20gb of RAM. It
should always be lower (since swapping does not work well for
GC based systems).

    On Tuesday, March 6, 2012 at 7:44 AM, Radu Gheorghe wrote:
    > Thanks, Shay!
    > I will try with those settings. My nodes have 8 cores and
    > 20GB of RAM.
    > I've started a test with 15 to 25GB min and max sizes (don't
    > have the
    > results yet). I will try again with ES_HEAP_SIZE=10GB and
    > see what the
    > results are.
    > On 5 mar., 17:21, Shay Banon <kim...@gmail.com> wrote:
    > > Yea, so I suggest setting the memory settings for ES. In
    > > 0.19, there is a simple env var that you can set called
    > > ES_HEAP_SIZE (see more
    > > here:

Elasticsearch Platform — Find real-time answers at scale | Elastic,
which sets both the min and max to the same value). I recommend you set it
to about half the memory you have on the machines (can you share some
details on those?).

    > > On Monday, March 5, 2012 at 9:27 AM, Radu Gheorghe wrote:
    > > > > > - Did you see any OOM (Out of Memory) failures in
    > > > > > the logs? Can you use big desk to see how memory is
    > > > > > used by the instances?
    > > > > I've checked the logs and I've checked big desk and I
    > > > > haven't seen any
    > > > > relevant symptom. But it might have been me failing
    > > > > here. I will
    > > > > double-check the logs tomorrow for OOM errors and
    > > > > report back.
    > > > I've double-checked the logs and I actually got quite a
    > > > lot of these:
    > > > java.lang.OutOfMemoryError: Java heap space
    > > > I can't believe I missed them in the first place :(
    > > > So, I will try again with a bigger ES_MAX_MEM in bin/
    > > > elasticsearch.in.sh (http://elasticsearch.in.sh). Would
    > > > it hurt if I set it to something huge like
    > > > "20g"?

On 07/03/12 06:16, Mark Waddle wrote:

Thank you Shay and Clint for your responses.

@Clint: How did you determine how much memory your file system cache
is consuming? Hopefully you are using Linux like I am ...

Just by using top and looking at the 'free' memory and the 'cached' size

clint

Monitoring heap memory used by ES is recommended as well (use big desk for it).

On Wednesday, March 7, 2012 at 6:35 PM, Clinton Gormley wrote:

On 07/03/12 06:16, Mark Waddle wrote:

Thank you Shay and Clint for your responses.

@Clint: How did you determine how much memory your file system cache is consuming? Hopefully you are using Linux like I am ...

Just by using top and looking at the 'free' memory and the 'cached' size

clint