Memory not released upon shutdown

Question for the memory gurus. Potentially not an ElasticSearch/Lucene
issue, but hopefully there are some settings to tweak to help out.

Running ES 0.20, default NIO filesystem, on a CentOS VM with 48 gigs. Could
not start up ES with 24 gigs of heap and mlockall enabled. Disabling
mlockall, with the same heap size, made ES work again. free shows 25 gigs
in use with ES not running and top shows no process with any significant
memory utilization. I do not have stats for the VM after a clean reboot. Is
there something within Lucene that grabs memory and never releases it? I am
used to profiling memory usages for running Java processes, but am clueless
when it comes to process that are not running!

Cheers,

Ivan

--

Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB which
is allowed to mlockall(). Maybe you need to adjust it in
/etc/security/limits.conf (followed by a re-login)

Cheers,

Jörg

On Tuesday, November 13, 2012 8:35:15 AM UTC+1, Ivan Brusic wrote:

Question for the memory gurus. Potentially not an ElasticSearch/Lucene
issue, but hopefully there are some settings to tweak to help out.

Running ES 0.20, default NIO filesystem, on a CentOS VM with 48 gigs.
Could not start up ES with 24 gigs of heap and mlockall enabled. Disabling
mlockall, with the same heap size, made ES work again. free shows 25 gigs
in use with ES not running and top shows no process with any significant
memory utilization. I do not have stats for the VM after a clean reboot. Is
there something within Lucene that grabs memory and never releases it? I am
used to profiling memory usages for running Java processes, but am clueless
when it comes to process that are not running!

Cheers,

Ivan

--

What does slabtop return in "Active / Total Size" line?

On Tuesday, November 13, 2012 6:11:33 AM UTC-5, Jörg Prante wrote:

Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB
which is allowed to mlockall(). Maybe you need to adjust it in
/etc/security/limits.conf (followed by a re-login)

Cheers,

Jörg

On Tuesday, November 13, 2012 8:35:15 AM UTC+1, Ivan Brusic wrote:

Question for the memory gurus. Potentially not an ElasticSearch/Lucene
issue, but hopefully there are some settings to tweak to help out.

Running ES 0.20, default NIO filesystem, on a CentOS VM with 48 gigs.
Could not start up ES with 24 gigs of heap and mlockall enabled. Disabling
mlockall, with the same heap size, made ES work again. free shows 25 gigs
in use with ES not running and top shows no process with any significant
memory utilization. I do not have stats for the VM after a clean reboot. Is
there something within Lucene that grabs memory and never releases it? I am
used to profiling memory usages for running Java processes, but am clueless
when it comes to process that are not running!

Cheers,

Ivan

--

Responses inline.
On Tue, Nov 13, 2012 at 3:11 AM, Jörg Prante joergprante@gmail.com wrote:

Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB
which is allowed to mlockall(). Maybe you need to adjust it in
/etc/security/limits.conf (followed by a re-login)

Yes, I should have mentioned that the ulimits are correctly set. Setting
the heap to 20GBs and mlockall works since there is enough memory
available, but the system has a lot more.

What does slabtop return in "Active / Total Size" line?

All measurements are on a VM that has been rebooted (with ES in rc.d), but
has not seen any queries. Heap set to 24 and using mlockall.

$ free -m
total used free shared buffers
cached
Mem: 48264 25703 22561 0 70 188

Active / Total Objects (% used) : 963790 / 971343 (99.2%)
Active / Total Slabs (% used) : 15839 / 15848 (99.9%)
Active / Total Caches (% used) : 105 / 185 (56.8%)
Active / Total Size (% used) : 63254.20K / 64421.21K (98.2%)
Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

$ /etc/init.d/elasticsearch stop
Stopping ElasticSearch...
Stopped ElasticSearch.

$ free -m
total used free shared buffers
cached
Mem: 48264 950 47314 0 70 188

Active / Total Objects (% used) : 959518 / 969149 (99.0%)
Active / Total Slabs (% used) : 15576 / 15592 (99.9%)
Active / Total Caches (% used) : 102 / 185 (55.1%)
Active / Total Size (% used) : 61793.27K / 63453.22K (97.4%)
Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

--

The numbers below are for a different node than the one I first reported
on. Assumed the problem would have been generic, but it might be specific
to a single node (for now).

Free shows the memory in use although ES has been stopped. Service does not
come back up.

$ free -m
total used free shared buffers cached
Mem: 48265 46150 2114 0 96 188

$ /etc/init.d/elasticsearch stop
Stopping ElasticSearch...
Stopped ElasticSearch.

$ free -m
total used free shared buffers cached
Mem: 48265 25470 22794 0 96 188

Active / Total Objects (% used) : 965687 / 976712 (98.9%)
Active / Total Slabs (% used) : 15743 / 15746 (100.0%)
Active / Total Caches (% used) : 101 / 185 (54.6%)
Active / Total Size (% used) : 62337.81K / 63982.26K (97.4%)
Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

$ /etc/init.d/elasticsearch start
Starting ElasticSearch...
Waiting for
ElasticSearch..................................................................
running: PID:4834

$ tail /opt/elasticsearch/logs/service.log
STATUS | wrapper | 2012/11/13 07:53:54 | JVM process is gone.
ERROR | wrapper | 2012/11/13 07:53:54 | JVM exited unexpectedly.
STATUS | wrapper | 2012/11/13 07:53:59 | Launching a JVM...
INFO | jvm 5 | 2012/11/13 07:54:01 | WrapperManager: Initializing...
STATUS | wrapper | 2012/11/13 07:54:19 | JVM received a signal SIGKILL (9).
STATUS | wrapper | 2012/11/13 07:54:19 | JVM process is gone.
ERROR | wrapper | 2012/11/13 07:54:19 | JVM exited unexpectedly.
FATAL | wrapper | 2012/11/13 07:54:19 | There were 5 failed launches in a
row, each lasting less than 300 seconds. Giving up.
FATAL | wrapper | 2012/11/13 07:54:19 | There may be a configuration
problem: please check the logs.
STATUS | wrapper | 2012/11/13 07:54:19 | <-- Wrapper Stopped

Probably a VM tuning issue.

Cheers,

Ivan

On Tue, Nov 13, 2012 at 7:46 AM, Ivan Brusic ivan@brusic.com wrote:

Responses inline.
On Tue, Nov 13, 2012 at 3:11 AM, Jörg Prante joergprante@gmail.comwrote:

Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB
which is allowed to mlockall(). Maybe you need to adjust it in
/etc/security/limits.conf (followed by a re-login)

Yes, I should have mentioned that the ulimits are correctly set. Setting
the heap to 20GBs and mlockall works since there is enough memory
available, but the system has a lot more.

What does slabtop return in "Active / Total Size" line?

All measurements are on a VM that has been rebooted (with ES in rc.d), but
has not seen any queries. Heap set to 24 and using mlockall.

$ free -m
total used free shared buffers
cached
Mem: 48264 25703 22561 0 70 188

Active / Total Objects (% used) : 963790 / 971343 (99.2%)
Active / Total Slabs (% used) : 15839 / 15848 (99.9%)
Active / Total Caches (% used) : 105 / 185 (56.8%)
Active / Total Size (% used) : 63254.20K / 64421.21K (98.2%)
Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

$ /etc/init.d/elasticsearch stop
Stopping ElasticSearch...
Stopped ElasticSearch.

$ free -m
total used free shared buffers
cached
Mem: 48264 950 47314 0 70 188

Active / Total Objects (% used) : 959518 / 969149 (99.0%)
Active / Total Slabs (% used) : 15576 / 15592 (99.9%)
Active / Total Caches (% used) : 102 / 185 (55.1%)
Active / Total Size (% used) : 61793.27K / 63453.22K (97.4%)
Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

--

Hi Ivan,

depending on the underlying OS memory organization, the JVM initialization
wants to be smart and tries to re-allocate in several steps up to the mem
size given in Xms to allocate the initial heap completely. On the other
hand, mlockall() is a single call via JNA, and this is not so smart. This
is certainly the reason why you observe mlockall() failures before Xms heap
allocation fails.

Since the standard JVM can not handle large heaps without stalls of seconds
or even minutes, you should reconsider your requirements. Extra large heaps
do not give extra large performance, quite contrary, they are not good for
performance. 24 GB is too much for the current standard JVM to handle. You
will get better and predictable performance with heaps of 4-8GB, because
the CMS garbage collector is targeted to perform well in that range. See
also http://openjdk.java.net/jeps/144 for an enhancement call to create a
better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7
Oracle
JVM http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--

Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the
question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem: 48264 950 47314 0 70 188
Mem: 48265 25470 22794 0 96 188

The latter node never releases the memory allocated toward it. Will be
upgrading to JDK7 shortly since there are various new GC options I want to
try out. But I would like to try things out with a clean slate and would
love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante joergprante@gmail.com wrote:

Hi Ivan,

depending on the underlying OS memory organization, the JVM initialization
wants to be smart and tries to re-allocate in several steps up to the mem
size given in Xms to allocate the initial heap completely. On the other
hand, mlockall() is a single call via JNA, and this is not so smart. This
is certainly the reason why you observe mlockall() failures before Xms heap
allocation fails.

Since the standard JVM can not handle large heaps without stalls of
seconds or even minutes, you should reconsider your requirements. Extra
large heaps do not give extra large performance, quite contrary, they are
not good for performance. 24 GB is too much for the current standard JVM to
handle. You will get better and predictable performance with heaps of
4-8GB, because the CMS garbage collector is targeted to perform well in
that range. See also http://openjdk.java.net/jeps/144 for an enhancement
call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7
Oracle JVM
http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--

--

Hi, a few notes here:

  1. The main reason mlockall is there is to make sure the memory
    (ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped.
    You can achieve that in other means, like setting swappiness. The reason
    you don't want a java process to swap is because of the way the garbage
    collector works, having to touch different parts of the process memory,
    causing it to swap in and out a lot of pages.
  2. Its perfectly fine to run elasticsearch with 24gb of memory, and even
    more. You won't observe large pauses. We work hard in elasticsearch to make
    sure we work nicely with the garbage collector to eliminate those pauses.
    Many users run elasticsearch with 30gb of memory in production.
  3. The more memory you have for the java process, the more memory can be
    used for things like filter cache (its automatically using 20% of the heap
    by default) and other related memory constructs. Leaving memory to the OS
    is also important so the OS file system cache do its magic as well.
    Usually, we recommend around 50% of OS memory to be allocate to the java
    process, but prefer to not allocate more than 30gb (because then the JVM
    can be smart and compress pointers sizes).

Regarding memory not being released, thats strange. Can you double check
that there isn't a process still running? Once the process no longer
exists, it will not take the mentioned memory.

On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:

Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the
question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem: 48264 950 47314 0 70 188
Mem: 48265 25470 22794 0 96 188

The latter node never releases the memory allocated toward it. Will be
upgrading to JDK7 shortly since there are various new GC options I want to
try out. But I would like to try things out with a clean slate and would
love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:

Hi Ivan,

depending on the underlying OS memory organization, the JVM
initialization wants to be smart and tries to re-allocate in several steps
up to the mem size given in Xms to allocate the initial heap completely. On
the other hand, mlockall() is a single call via JNA, and this is not so
smart. This is certainly the reason why you observe mlockall() failures
before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of
seconds or even minutes, you should reconsider your requirements. Extra
large heaps do not give extra large performance, quite contrary, they are
not good for performance. 24 GB is too much for the current standard JVM to
handle. You will get better and predictable performance with heaps of
4-8GB, because the CMS garbage collector is targeted to perform well in
that range. See also http://openjdk.java.net/jeps/144 for an enhancement
call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7
Oracle JVM
http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--

--

Necro'ing the thread to say we may be seeing a version of this. We have a
uniform cluster of eight machines that run two systems: a transfer-only
elasticsearch node (no data, no master and no http), with 1GB heap and
mlockall=true; and a Storm+Trident topology that reads and writes several
thousand records per second in batch requests using the Java Client API. On
all the machines, over the course of a couple weeks -- untouched, in steady
state -- the memory usage of the processes does not change, but the amount
of free ram reported on the machine does.

The machine claims (see free -m) to be using 5.7 GB out of 6.9GB ram, not
counting the OS buffers+caches. Yet the ps aux output shows the amount of
ram taken by active processes is only about 2.5GB -- there are 3+ missing
GB of data. Meminfo shows that there is about 2.5GB of slab cache, and it
is almost entirely consumed (says slabtop) by 'dentries': 605k slabs for
2.5GB ram on 12 M objects.

I can't say for sure whether this is a Storm thing or an ES thing, but It's
pretty clear that something is presenting Linux with an infinitely
fascinating number of ephemeral directories to cache. Does that sound like
anything ES/Lucene could produce? Given that it takes a couple weeks to
create the problem, we're unlikely to be able to do experiments. (We are
going increase the vfs_cache_pressure vaule to 10000 and otherwise just
keep a close eye on things).


In case anyone else hits this, here are some relevant things to google for
(proceed at your own risk):

  • Briefly exerting some memory pressure on one of these nodes (sort -S 500M) made it reclaim some of the slab cache -- its population declined to
    what you see below. My understanding is that the system will reclaim data
    from the slab cache exactly as needed. (Basically: this is not an
    implementation bug in the system producing the large slab occupancy, it's a
    UX bug in that htop, free and our monitoring tool don't include it under
    bufs+caches.) It at least makes monitoring a pain.

vfs_cache_pressure:
"Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects. When vfs_cache_pressure=0, the
kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes."

  • From this SO
    thread
    ,
    "If the slab cache is resposible for a large portion of your "missing
    memory", check /proc/slabinfo to see where it's gone. If it's dentries or
    inodes, you can use sudo bash -c 'sync ; echo 2 > /proc/sys/vm/drop_caches' to get rid of them"

free -m

             total       used       free     shared    buffers     

cached
Mem: 6948 5725 1222 0 268
462
-/+ buffers/cache: 4994 1953
Swap: 0 0 0

ps aux | sort -rnk6 | head -n 20 | cut -c 1-100

USER       PID %CPU %MEM     VSZ    RSS TTY    STAT START TIME   

COMMAND
61021 5170 9.9 13.5 5449368 960588 ? Sl Jun28 2890:23 java
(elasticsearch)
storm 22628 41.2 9.1 4477532 653556 ? Sl Jul01 9775:58 java
(trident state)
storm 22623 6.0 1.8 3212816 133268 ? Sl Jul01 1438:13 java
(trident wu)
storm 22621 6.0 1.8 3212816 129300 ? Sl Jul01 1423:30 java
(trident wu)
storm 22625 6.1 1.8 3212816 128320 ? Sl Jul01 1450:38 java
(trident wu)
storm 22631 6.2 1.7 3212816 125740 ? Sl Jul01 1481:30 java
(trident wu)
storm 5629 0.4 1.6 3576976 114916 ? Sl Jun28 140:35 java
(storm supervisor)
storm 22814 23.5 0.4 116240 34584 ? Sl Jul01 5577:39 ruby
(wu)
storm 22822 23.4 0.4 116204 34548 ? Sl Jul01 5552:50 ruby
(wu)
storm 22806 23.4 0.4 116200 34544 ? Sl Jul01 5554:17 ruby
(wu)
storm 22830 23.3 0.4 116180 34524 ? Sl Jul01 5534:38 ruby
(wu)
flip 7928 0.0 0.1 25352 7900 pts/4 Ss 06:31 0:00 -bash
flip 10268 0.0 0.0 25352 6548 pts/4 S+ 06:51 0:00 -bash
syslog 718 0.0 0.0 254488 5024 ? Sl Apr05 15:30
rsyslogd -c5
root 7725 0.0 0.0 73360 3576 ? Ss 06:31 0:00 sshd:
flip [priv]
flip 7927 0.0 0.0 73360 1676 ? S 06:31 0:00 sshd:
flip@pts/4
whoopsie 836 0.0 0.0 187588 1628 ? Ssl Apr05 0:00
whoopsie
root 1 0.0 0.0 24460 1476 ? Ss Apr05 0:57
/sbin/init
flip 10272 0.0 0.0 16884 1260 pts/4 R+ 06:51 0:00
/bin/ps aux

slabtop

 Active / Total Objects (% used)    : 12069032 / 13126009 (91.9%)
 Active / Total Slabs (% used)      : 615122 / 615122 (100.0%)
 Active / Total Caches (% used)     : 68 / 106 (64.2%)
 Active / Total Size (% used)       : 2270155.02K / 2467052.45K (92.0%)
 Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K

    OBJS   ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
12720456 11688175  91%    0.19K 605736       21   2422944K dentry
  182091   163690  89%    0.10K   4669       39     18676K buffer_head
   22496    22405  99%    0.86K    608       37     19456K 

ext4_inode_cache
21760 21760 100% 0.02K 85 256 340K ext4_io_page
21504 21504 100% 0.01K 42 512 168K kmalloc-8
17680 16830 95% 0.02K 104 170 416K numa_policy
11475 9558 83% 0.05K 135 85 540K
shared_policy_node
...

dentry-state

sudo cat /proc/sys/fs/dentry-state
11688070        11677721        45      0       0       0

see http://linux.about.com/library/cmd/blcmdl5_slabinfo.htm
sudo cat /proc/slabinfo | sort -rnk2 | head

dentry            11689648 12720456    192   21    1 : tunables 0 0 0 : 

slabdata 605736 605736 0
buffer_head 163690 182091 104 39 1 : tunables 0 0 0 :
slabdata 4669 4669 0
ext4_inode_cache 22405 22496 880 37 8 : tunables 0 0 0 :
slabdata 608 608 0
ext4_io_page 21760 21760 16 256 1 : tunables 0 0 0 :
slabdata 85 85 0
kmalloc-8 21504 21504 8 512 1 : tunables 0 0 0 :
slabdata 42 42 0
numa_policy 16830 17680 24 170 1 : tunables 0 0 0 :
slabdata 104 104 0
sysfs_dir_cache 11396 11396 144 28 1 : tunables 0 0 0 :
slabdata 407 407 0
kmalloc-64 11072 11072 64 64 1 : tunables 0 0 0 :
slabdata 173 173 0
kmalloc-32 9344 9344 32 128 1 : tunables 0 0 0 :
slabdata 73 73 0

sudo cat /proc/meminfo

MemTotal:        7114792 kB
MemFree:         1443160 kB
Buffers:          275232 kB
Cached:           446828 kB
SwapCached:            0 kB
Active:          2810096 kB
Inactive:         240064 kB
Active(anon):    2299088 kB
Inactive(anon):      720 kB
Active(file):     511008 kB
Inactive(file):   239344 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               260 kB
Writeback:             0 kB
AnonPages:       2299184 kB
Mapped:            27944 kB
Shmem:               772 kB
Slab:            2506124 kB
SReclaimable:    2479280 kB
SUnreclaim:        26844 kB
KernelStack:        3512 kB
PageTables:        12968 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7114792 kB
Committed_AS:    2626600 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       26116 kB
VmallocChunk:   34359710188 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     7348224 kB
DirectMap2M:           0 kB

On Wednesday, November 14, 2012 6:50:24 AM UTC-6, kimchy wrote:

Hi, a few notes here:

  1. The main reason mlockall is there is to make sure the memory
    (ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped.
    You can achieve that in other means, like setting swappiness. The reason
    you don't want a java process to swap is because of the way the garbage
    collector works, having to touch different parts of the process memory,
    causing it to swap in and out a lot of pages.
  2. Its perfectly fine to run elasticsearch with 24gb of memory, and even
    more. You won't observe large pauses. We work hard in elasticsearch to make
    sure we work nicely with the garbage collector to eliminate those pauses.
    Many users run elasticsearch with 30gb of memory in production.
  3. The more memory you have for the java process, the more memory can be
    used for things like filter cache (its automatically using 20% of the heap
    by default) and other related memory constructs. Leaving memory to the OS
    is also important so the OS file system cache do its magic as well.
    Usually, we recommend around 50% of OS memory to be allocate to the java
    process, but prefer to not allocate more than 30gb (because then the JVM
    can be smart and compress pointers sizes).

Regarding memory not being released, thats strange. Can you double check
that there isn't a process still running? Once the process no longer
exists, it will not take the mentioned memory.

On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:

Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the
question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem: 48264 950 47314 0 70 188
Mem: 48265 25470 22794 0 96 188

The latter node never releases the memory allocated toward it. Will be
upgrading to JDK7 shortly since there are various new GC options I want to
try out. But I would like to try things out with a clean slate and would
love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante joerg...@gmail.com wrote:

Hi Ivan,

depending on the underlying OS memory organization, the JVM
initialization wants to be smart and tries to re-allocate in several steps
up to the mem size given in Xms to allocate the initial heap completely. On
the other hand, mlockall() is a single call via JNA, and this is not so
smart. This is certainly the reason why you observe mlockall() failures
before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of
seconds or even minutes, you should reconsider your requirements. Extra
large heaps do not give extra large performance, quite contrary, they are
not good for performance. 24 GB is too much for the current standard JVM to
handle. You will get better and predictable performance with heaps of
4-8GB, because the CMS garbage collector is targeted to perform well in
that range. See also http://openjdk.java.net/jeps/144 for an
enhancement call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java
7 Oracle JVM
http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I saw something like this about 1.5 years ago with an old version of Ubuntu
(hence my question about slabtop). It went away after upgrading kernel to
the latest version. Which version of kernel are you running?

On Thursday, July 18, 2013 3:50:04 AM UTC-4, Philip (Flip) Kromer wrote:

Necro'ing the thread to say we may be seeing a version of this. We have a
uniform cluster of eight machines that run two systems: a transfer-only
elasticsearch node (no data, no master and no http), with 1GB heap and
mlockall=true; and a Storm+Trident topology that reads and writes several
thousand records per second in batch requests using the Java Client API. On
all the machines, over the course of a couple weeks -- untouched, in steady
state -- the memory usage of the processes does not change, but the amount
of free ram reported on the machine does.

The machine claims (see free -m) to be using 5.7 GB out of 6.9GB ram,
not counting the OS buffers+caches. Yet the ps aux output shows the
amount of ram taken by active processes is only about 2.5GB -- there are 3+
missing GB of data. Meminfo shows that there is about 2.5GB of slab cache,
and it is almost entirely consumed (says slabtop) by 'dentries': 605k slabs
for 2.5GB ram on 12 M objects.

I can't say for sure whether this is a Storm thing or an ES thing, but
It's pretty clear that something is presenting Linux with an infinitely
fascinating number of ephemeral directories to cache. Does that sound like
anything ES/Lucene could produce? Given that it takes a couple weeks to
create the problem, we're unlikely to be able to do experiments. (We are
going increase the vfs_cache_pressure vaule to 10000 and otherwise just
keep a close eye on things).


In case anyone else hits this, here are some relevant things to google for
(proceed at your own risk):

  • Briefly exerting some memory pressure on one of these nodes (sort -S 500M) made it reclaim some of the slab cache -- its population declined to
    what you see below. My understanding is that the system will reclaim data
    from the slab cache exactly as needed. (Basically: this is not an
    implementation bug in the system producing the large slab occupancy, it's a
    UX bug in that htop, free and our monitoring tool don't include it under
    bufs+caches.) It at least makes monitoring a pain.

  • vfs_cache_pressure: "Controls the
    tendency of the kernel to reclaim the memory which is used for
    caching of directory and inode objects. When vfs_cache_pressure=0, the
    kernel will
    never reclaim dentries and inodes due to memory pressure and this can
    easily
    lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
    causes the kernel to prefer to reclaim dentries and inodes."

  • From this SO thread,
    "If the slab cache is resposible for a large portion of your "missing
    memory", check /proc/slabinfo to see where it's gone. If it's dentries or
    inodes, you can use sudo bash -c 'sync ; echo 2 > /proc/sys/vm/drop_caches' to get rid of them"


free -m

             total       used       free     shared    buffers     

cached
Mem: 6948 5725 1222 0 268
462
-/+ buffers/cache: 4994 1953
Swap: 0 0 0

ps aux | sort -rnk6 | head -n 20 | cut -c 1-100

USER       PID %CPU %MEM     VSZ    RSS TTY    STAT START TIME   

COMMAND
61021 5170 9.9 13.5 5449368 960588 ? Sl Jun28 2890:23 java
(elasticsearch)
storm 22628 41.2 9.1 4477532 653556 ? Sl Jul01 9775:58 java
(trident state)
storm 22623 6.0 1.8 3212816 133268 ? Sl Jul01 1438:13 java
(trident wu)
storm 22621 6.0 1.8 3212816 129300 ? Sl Jul01 1423:30 java
(trident wu)
storm 22625 6.1 1.8 3212816 128320 ? Sl Jul01 1450:38 java
(trident wu)
storm 22631 6.2 1.7 3212816 125740 ? Sl Jul01 1481:30 java
(trident wu)
storm 5629 0.4 1.6 3576976 114916 ? Sl Jun28 140:35 java
(storm supervisor)
storm 22814 23.5 0.4 116240 34584 ? Sl Jul01 5577:39 ruby
(wu)
storm 22822 23.4 0.4 116204 34548 ? Sl Jul01 5552:50 ruby
(wu)
storm 22806 23.4 0.4 116200 34544 ? Sl Jul01 5554:17 ruby
(wu)
storm 22830 23.3 0.4 116180 34524 ? Sl Jul01 5534:38 ruby
(wu)
flip 7928 0.0 0.1 25352 7900 pts/4 Ss 06:31 0:00 -bash
flip 10268 0.0 0.0 25352 6548 pts/4 S+ 06:51 0:00 -bash
syslog 718 0.0 0.0 254488 5024 ? Sl Apr05 15:30
rsyslogd -c5
root 7725 0.0 0.0 73360 3576 ? Ss 06:31 0:00
sshd: flip [priv]
flip 7927 0.0 0.0 73360 1676 ? S 06:31 0:00
sshd: flip@pts/4
whoopsie 836 0.0 0.0 187588 1628 ? Ssl Apr05 0:00
whoopsie
root 1 0.0 0.0 24460 1476 ? Ss Apr05 0:57
/sbin/init
flip 10272 0.0 0.0 16884 1260 pts/4 R+ 06:51 0:00
/bin/ps aux

slabtop

 Active / Total Objects (% used)    : 12069032 / 13126009 (91.9%)
 Active / Total Slabs (% used)      : 615122 / 615122 (100.0%)
 Active / Total Caches (% used)     : 68 / 106 (64.2%)
 Active / Total Size (% used)       : 2270155.02K / 2467052.45K (92.0%)
 Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K

    OBJS   ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
12720456 11688175  91%    0.19K 605736       21   2422944K dentry
  182091   163690  89%    0.10K   4669       39     18676K buffer_head
   22496    22405  99%    0.86K    608       37     19456K 

ext4_inode_cache
21760 21760 100% 0.02K 85 256 340K ext4_io_page
21504 21504 100% 0.01K 42 512 168K kmalloc-8
17680 16830 95% 0.02K 104 170 416K numa_policy
11475 9558 83% 0.05K 135 85 540K
shared_policy_node
...

dentry-state

sudo cat /proc/sys/fs/dentry-state
11688070        11677721        45      0       0       0

see http://linux.about.com/library/cmd/blcmdl5_slabinfo.htm
sudo cat /proc/slabinfo | sort -rnk2 | head

dentry            11689648 12720456    192   21    1 : tunables 0 0 0 

: slabdata 605736 605736 0
buffer_head 163690 182091 104 39 1 : tunables 0 0 0
: slabdata 4669 4669 0
ext4_inode_cache 22405 22496 880 37 8 : tunables 0 0 0
: slabdata 608 608 0
ext4_io_page 21760 21760 16 256 1 : tunables 0 0 0
: slabdata 85 85 0
kmalloc-8 21504 21504 8 512 1 : tunables 0 0 0
: slabdata 42 42 0
numa_policy 16830 17680 24 170 1 : tunables 0 0 0
: slabdata 104 104 0
sysfs_dir_cache 11396 11396 144 28 1 : tunables 0 0 0
: slabdata 407 407 0
kmalloc-64 11072 11072 64 64 1 : tunables 0 0 0
: slabdata 173 173 0
kmalloc-32 9344 9344 32 128 1 : tunables 0 0 0
: slabdata 73 73 0

sudo cat /proc/meminfo

MemTotal:        7114792 kB
MemFree:         1443160 kB
Buffers:          275232 kB
Cached:           446828 kB
SwapCached:            0 kB
Active:          2810096 kB
Inactive:         240064 kB
Active(anon):    2299088 kB
Inactive(anon):      720 kB
Active(file):     511008 kB
Inactive(file):   239344 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               260 kB
Writeback:             0 kB
AnonPages:       2299184 kB
Mapped:            27944 kB
Shmem:               772 kB
Slab:            2506124 kB
SReclaimable:    2479280 kB
SUnreclaim:        26844 kB
KernelStack:        3512 kB
PageTables:        12968 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7114792 kB
Committed_AS:    2626600 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       26116 kB
VmallocChunk:   34359710188 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     7348224 kB
DirectMap2M:           0 kB

On Wednesday, November 14, 2012 6:50:24 AM UTC-6, kimchy wrote:

Hi, a few notes here:

  1. The main reason mlockall is there is to make sure the memory
    (ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped.
    You can achieve that in other means, like setting swappiness. The reason
    you don't want a java process to swap is because of the way the garbage
    collector works, having to touch different parts of the process memory,
    causing it to swap in and out a lot of pages.
  2. Its perfectly fine to run elasticsearch with 24gb of memory, and even
    more. You won't observe large pauses. We work hard in elasticsearch to make
    sure we work nicely with the garbage collector to eliminate those pauses.
    Many users run elasticsearch with 30gb of memory in production.
  3. The more memory you have for the java process, the more memory can be
    used for things like filter cache (its automatically using 20% of the heap
    by default) and other related memory constructs. Leaving memory to the OS
    is also important so the OS file system cache do its magic as well.
    Usually, we recommend around 50% of OS memory to be allocate to the java
    process, but prefer to not allocate more than 30gb (because then the JVM
    can be smart and compress pointers sizes).

Regarding memory not being released, thats strange. Can you double check
that there isn't a process still running? Once the process no longer
exists, it will not take the mentioned memory.

On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:

Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the
question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem: 48264 950 47314 0 70 188
Mem: 48265 25470 22794 0 96 188

The latter node never releases the memory allocated toward it. Will be
upgrading to JDK7 shortly since there are various new GC options I want to
try out. But I would like to try things out with a clean slate and would
love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante joerg...@gmail.comwrote:

Hi Ivan,

depending on the underlying OS memory organization, the JVM
initialization wants to be smart and tries to re-allocate in several steps
up to the mem size given in Xms to allocate the initial heap completely. On
the other hand, mlockall() is a single call via JNA, and this is not so
smart. This is certainly the reason why you observe mlockall() failures
before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of
seconds or even minutes, you should reconsider your requirements. Extra
large heaps do not give extra large performance, quite contrary, they are
not good for performance. 24 GB is too much for the current standard JVM to
handle. You will get better and predictable performance with heaps of
4-8GB, because the CMS garbage collector is targeted to perform well in
that range. See also http://openjdk.java.net/jeps/144 for an
enhancement call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java
7 Oracle JVM
http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Am 18.07.13 09:50, schrieb Philip (Flip) Kromer:

Necro'ing the thread to say we may be seeing a version of this. We
have a uniform cluster of eight machines that run two systems: a
transfer-only elasticsearch node (no data, no master and no http),
with 1GB heap and mlockall=true; and a Storm+Trident topology that
reads and writes several thousand records per second in batch requests
using the Java Client API. On all the machines, over the course of a
couple weeks -- untouched, in steady state -- the memory usage of the
processes does not change, but the amount of free ram reported on the
machine does.

The machine claims (see free -m) to be using 5.7 GB out of 6.9GB
ram, not counting the OS buffers+caches. Yet the ps aux output shows
the amount of ram taken by active processes is only about 2.5GB --
there are 3+ missing GB of data. Meminfo shows that there is about
2.5GB of slab cache, and it is almost entirely consumed (says slabtop)
by 'dentries': 605k slabs for 2.5GB ram on 12 M objects.

I can't say for sure whether this is a Storm thing or an ES thing, but
It's pretty clear that something is presenting Linux with an
infinitely fascinating number of ephemeral directories to cache. Does
that sound like anything ES/Lucene could produce?

Nothing known like this - the dentry cache is under control of the kernel.

Given that it takes a couple weeks to create the problem, we're
unlikely to be able to do experiments. (We are going increase the
vfs_cache_pressure vaule to 10000 and otherwise just keep a close
eye on things).

Instead, you should upgrade the kernel because of two advantages: better
selinux handling / improved extfs and memory management, all may have
side effects on dentry cache. Check if you have custom IP monitoring /
net traffic filtering kernel modules running, or custom selinux
settings. In general, nothing much to worry, until the kernel begins to
kill processes because of OOM.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.