Necro'ing the thread to say we may be seeing a version of this. We have a
uniform cluster of eight machines that run two systems: a transfer-only
elasticsearch node (no data, no master and no http), with 1GB heap and
mlockall=true; and a Storm+Trident topology that reads and writes several
thousand records per second in batch requests using the Java Client API. On
all the machines, over the course of a couple weeks -- untouched, in steady
state -- the memory usage of the processes does not change, but the amount
of free ram reported on the machine does.
The machine claims (see free -m
) to be using 5.7 GB out of 6.9GB ram, not
counting the OS buffers+caches. Yet the ps aux
output shows the amount of
ram taken by active processes is only about 2.5GB -- there are 3+ missing
GB of data. Meminfo shows that there is about 2.5GB of slab cache, and it
is almost entirely consumed (says slabtop) by 'dentries': 605k slabs for
2.5GB ram on 12 M objects.
I can't say for sure whether this is a Storm thing or an ES thing, but It's
pretty clear that something is presenting Linux with an infinitely
fascinating number of ephemeral directories to cache. Does that sound like
anything ES/Lucene could produce? Given that it takes a couple weeks to
create the problem, we're unlikely to be able to do experiments. (We are
going increase the vfs_cache_pressure
vaule to 10000 and otherwise just
keep a close eye on things).
In case anyone else hits this, here are some relevant things to google for
(proceed at your own risk):
-
Briefly exerting some memory pressure on one of these nodes (sort -S 500M
) made it reclaim some of the slab cache -- its population declined to
what you see below. My understanding is that the system will reclaim data
from the slab cache exactly as needed. (Basically: this is not an
implementation bug in the system producing the large slab occupancy, it's a
UX bug in that htop, free and our monitoring tool don't include it under
bufs+caches.) It at least makes monitoring a pain.
vfs_cache_pressure
:
"Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects. When vfs_cache_pressure=0, the
kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes."
- From this SO
thread,
"If the slab cache is resposible for a large portion of your "missing
memory", check /proc/slabinfo to see where it's gone. If it's dentries or
inodes, you can use sudo bash -c 'sync ; echo 2 > /proc/sys/vm/drop_caches'
to get rid of them"
free -m
total used free shared buffers
cached
Mem: 6948 5725 1222 0 268
462
-/+ buffers/cache: 4994 1953
Swap: 0 0 0
ps aux | sort -rnk6 | head -n 20 | cut -c 1-100
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
COMMAND
61021 5170 9.9 13.5 5449368 960588 ? Sl Jun28 2890:23 java
(elasticsearch)
storm 22628 41.2 9.1 4477532 653556 ? Sl Jul01 9775:58 java
(trident state)
storm 22623 6.0 1.8 3212816 133268 ? Sl Jul01 1438:13 java
(trident wu)
storm 22621 6.0 1.8 3212816 129300 ? Sl Jul01 1423:30 java
(trident wu)
storm 22625 6.1 1.8 3212816 128320 ? Sl Jul01 1450:38 java
(trident wu)
storm 22631 6.2 1.7 3212816 125740 ? Sl Jul01 1481:30 java
(trident wu)
storm 5629 0.4 1.6 3576976 114916 ? Sl Jun28 140:35 java
(storm supervisor)
storm 22814 23.5 0.4 116240 34584 ? Sl Jul01 5577:39 ruby
(wu)
storm 22822 23.4 0.4 116204 34548 ? Sl Jul01 5552:50 ruby
(wu)
storm 22806 23.4 0.4 116200 34544 ? Sl Jul01 5554:17 ruby
(wu)
storm 22830 23.3 0.4 116180 34524 ? Sl Jul01 5534:38 ruby
(wu)
flip 7928 0.0 0.1 25352 7900 pts/4 Ss 06:31 0:00 -bash
flip 10268 0.0 0.0 25352 6548 pts/4 S+ 06:51 0:00 -bash
syslog 718 0.0 0.0 254488 5024 ? Sl Apr05 15:30
rsyslogd -c5
root 7725 0.0 0.0 73360 3576 ? Ss 06:31 0:00 sshd:
flip [priv]
flip 7927 0.0 0.0 73360 1676 ? S 06:31 0:00 sshd:
flip@pts/4
whoopsie 836 0.0 0.0 187588 1628 ? Ssl Apr05 0:00
whoopsie
root 1 0.0 0.0 24460 1476 ? Ss Apr05 0:57
/sbin/init
flip 10272 0.0 0.0 16884 1260 pts/4 R+ 06:51 0:00
/bin/ps aux
slabtop
Active / Total Objects (% used) : 12069032 / 13126009 (91.9%)
Active / Total Slabs (% used) : 615122 / 615122 (100.0%)
Active / Total Caches (% used) : 68 / 106 (64.2%)
Active / Total Size (% used) : 2270155.02K / 2467052.45K (92.0%)
Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
12720456 11688175 91% 0.19K 605736 21 2422944K dentry
182091 163690 89% 0.10K 4669 39 18676K buffer_head
22496 22405 99% 0.86K 608 37 19456K
ext4_inode_cache
21760 21760 100% 0.02K 85 256 340K ext4_io_page
21504 21504 100% 0.01K 42 512 168K kmalloc-8
17680 16830 95% 0.02K 104 170 416K numa_policy
11475 9558 83% 0.05K 135 85 540K
shared_policy_node
...
dentry-state
sudo cat /proc/sys/fs/dentry-state
11688070 11677721 45 0 0 0
see How to Create and Use DLLs in Delphi
sudo cat /proc/slabinfo | sort -rnk2 | head
dentry 11689648 12720456 192 21 1 : tunables 0 0 0 :
slabdata 605736 605736 0
buffer_head 163690 182091 104 39 1 : tunables 0 0 0 :
slabdata 4669 4669 0
ext4_inode_cache 22405 22496 880 37 8 : tunables 0 0 0 :
slabdata 608 608 0
ext4_io_page 21760 21760 16 256 1 : tunables 0 0 0 :
slabdata 85 85 0
kmalloc-8 21504 21504 8 512 1 : tunables 0 0 0 :
slabdata 42 42 0
numa_policy 16830 17680 24 170 1 : tunables 0 0 0 :
slabdata 104 104 0
sysfs_dir_cache 11396 11396 144 28 1 : tunables 0 0 0 :
slabdata 407 407 0
kmalloc-64 11072 11072 64 64 1 : tunables 0 0 0 :
slabdata 173 173 0
kmalloc-32 9344 9344 32 128 1 : tunables 0 0 0 :
slabdata 73 73 0
sudo cat /proc/meminfo
MemTotal: 7114792 kB
MemFree: 1443160 kB
Buffers: 275232 kB
Cached: 446828 kB
SwapCached: 0 kB
Active: 2810096 kB
Inactive: 240064 kB
Active(anon): 2299088 kB
Inactive(anon): 720 kB
Active(file): 511008 kB
Inactive(file): 239344 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 260 kB
Writeback: 0 kB
AnonPages: 2299184 kB
Mapped: 27944 kB
Shmem: 772 kB
Slab: 2506124 kB
SReclaimable: 2479280 kB
SUnreclaim: 26844 kB
KernelStack: 3512 kB
PageTables: 12968 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7114792 kB
Committed_AS: 2626600 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 26116 kB
VmallocChunk: 34359710188 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7348224 kB
DirectMap2M: 0 kB
On Wednesday, November 14, 2012 6:50:24 AM UTC-6, kimchy wrote:
Hi, a few notes here:
- The main reason mlockall is there is to make sure the memory
(ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped.
You can achieve that in other means, like setting swappiness. The reason
you don't want a java process to swap is because of the way the garbage
collector works, having to touch different parts of the process memory,
causing it to swap in and out a lot of pages.
- Its perfectly fine to run elasticsearch with 24gb of memory, and even
more. You won't observe large pauses. We work hard in elasticsearch to make
sure we work nicely with the garbage collector to eliminate those pauses.
Many users run elasticsearch with 30gb of memory in production.
- The more memory you have for the java process, the more memory can be
used for things like filter cache (its automatically using 20% of the heap
by default) and other related memory constructs. Leaving memory to the OS
is also important so the OS file system cache do its magic as well.
Usually, we recommend around 50% of OS memory to be allocate to the java
process, but prefer to not allocate more than 30gb (because then the JVM
can be smart and compress pointers sizes).
Regarding memory not being released, thats strange. Can you double check
that there isn't a process still running? Once the process no longer
exists, it will not take the mentioned memory.
On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:
Thanks Jörg.
I completely understand why the JVM refuses to start with mlockall, the
question is why is there not enough free memory to begin with?
The difference between the nodes after ES has stopped:
Mem: 48264 950 47314 0 70 188
Mem: 48265 25470 22794 0 96 188
The latter node never releases the memory allocated toward it. Will be
upgrading to JDK7 shortly since there are various new GC options I want to
try out. But I would like to try things out with a clean slate and would
love to resolve the memory issue.
Ivan
On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante joerg...@gmail.com wrote:
Hi Ivan,
depending on the underlying OS memory organization, the JVM
initialization wants to be smart and tries to re-allocate in several steps
up to the mem size given in Xms to allocate the initial heap completely. On
the other hand, mlockall() is a single call via JNA, and this is not so
smart. This is certainly the reason why you observe mlockall() failures
before Xms heap allocation fails.
Since the standard JVM can not handle large heaps without stalls of
seconds or even minutes, you should reconsider your requirements. Extra
large heaps do not give extra large performance, quite contrary, they are
not good for performance. 24 GB is too much for the current standard JVM to
handle. You will get better and predictable performance with heaps of
4-8GB, because the CMS garbage collector is targeted to perform well in
that range. See also JEP 144: Reduce GC Latency for Large Heaps for an
enhancement call to create a better, scalable GC for larger RAMs.
Maybe you are interested in activating the G1 garbage collector in Java
7 Oracle JVM
Java HotSpot Garbage Collection
Cheers,
Jörg
--
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.