Necro'ing the thread to say we may be seeing a version of this. We have a
uniform cluster of eight machines that run two systems: a transfer-only
elasticsearch node (no data, no master and no http), with 1GB heap and
mlockall=true; and a Storm+Trident topology that reads and writes several
thousand records per second in batch requests using the Java Client API. On
all the machines, over the course of a couple weeks -- untouched, in steady
state -- the memory usage of the processes does not change, but the amount
of free ram reported on the machine does.
The machine claims (see free -m) to be using 5.7 GB out of 6.9GB ram, not
counting the OS buffers+caches. Yet the ps aux output shows the amount of
ram taken by active processes is only about 2.5GB -- there are 3+ missing
GB of data. Meminfo shows that there is about 2.5GB of slab cache, and it
is almost entirely consumed (says slabtop) by 'dentries': 605k slabs for
2.5GB ram on 12 M objects.
I can't say for sure whether this is a Storm thing or an ES thing, but It's
pretty clear that something is presenting Linux with an infinitely
fascinating number of ephemeral directories to cache. Does that sound like
anything ES/Lucene could produce? Given that it takes a couple weeks to
create the problem, we're unlikely to be able to do experiments. (We are
going increase the vfs_cache_pressure vaule to 10000 and otherwise just
keep a close eye on things).
In case anyone else hits this, here are some relevant things to google for
(proceed at your own risk):
- 
Briefly exerting some memory pressure on one of these nodes (sort -S  500M) made it reclaim some of the slab cache -- its population declined to
what you see below. My understanding is that the system will reclaim data
from the slab cache exactly as needed. (Basically: this is not an
implementation bug in the system producing the large slab occupancy, it's a
UX bug in that htop, free and our monitoring tool don't include it under
bufs+caches.) It at least makes monitoring a pain.
 
vfs_cache_pressure:
"Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects. When vfs_cache_pressure=0, the
kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes."
- From this SO
thread,
"If the slab cache is resposible for a large portion of your "missing
memory", check /proc/slabinfo to see where it's gone. If it's dentries or
inodes, you can use sudo bash -c 'sync ; echo 2 >  /proc/sys/vm/drop_caches' to get rid of them" 
free -m
             total       used       free     shared    buffers     
cached
Mem:          6948       5725       1222          0        268
462
-/+ buffers/cache:       4994       1953
Swap:            0          0          0
ps aux | sort -rnk6 | head -n 20 | cut -c 1-100
USER       PID %CPU %MEM     VSZ    RSS TTY    STAT START TIME   
COMMAND
61021     5170  9.9 13.5 5449368 960588 ?      Sl   Jun28 2890:23 java
(elasticsearch)
storm    22628 41.2  9.1 4477532 653556 ?      Sl   Jul01 9775:58 java
(trident state)
storm    22623  6.0  1.8 3212816 133268 ?      Sl   Jul01 1438:13 java
(trident wu)
storm    22621  6.0  1.8 3212816 129300 ?      Sl   Jul01 1423:30 java
(trident wu)
storm    22625  6.1  1.8 3212816 128320 ?      Sl   Jul01 1450:38 java
(trident wu)
storm    22631  6.2  1.7 3212816 125740 ?      Sl   Jul01 1481:30 java
(trident wu)
storm     5629  0.4  1.6 3576976 114916 ?      Sl   Jun28  140:35 java
(storm supervisor)
storm    22814 23.5  0.4  116240  34584 ?      Sl   Jul01 5577:39 ruby
(wu)
storm    22822 23.4  0.4  116204  34548 ?      Sl   Jul01 5552:50 ruby
(wu)
storm    22806 23.4  0.4  116200  34544 ?      Sl   Jul01 5554:17 ruby
(wu)
storm    22830 23.3  0.4  116180  34524 ?      Sl   Jul01 5534:38 ruby
(wu)
flip      7928  0.0  0.1   25352   7900 pts/4  Ss   06:31    0:00 -bash
flip     10268  0.0  0.0   25352   6548 pts/4  S+   06:51    0:00 -bash
syslog     718  0.0  0.0  254488   5024 ?      Sl   Apr05   15:30
rsyslogd -c5
root      7725  0.0  0.0   73360   3576 ?      Ss   06:31    0:00 sshd:
flip [priv]
flip      7927  0.0  0.0   73360   1676 ?      S    06:31    0:00 sshd:
flip@pts/4
whoopsie   836  0.0  0.0  187588   1628 ?      Ssl  Apr05    0:00
whoopsie
root         1  0.0  0.0   24460   1476 ?      Ss   Apr05    0:57
/sbin/init
flip     10272  0.0  0.0   16884   1260 pts/4  R+   06:51    0:00
/bin/ps aux
slabtop
 Active / Total Objects (% used)    : 12069032 / 13126009 (91.9%)
 Active / Total Slabs (% used)      : 615122 / 615122 (100.0%)
 Active / Total Caches (% used)     : 68 / 106 (64.2%)
 Active / Total Size (% used)       : 2270155.02K / 2467052.45K (92.0%)
 Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K
    OBJS   ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
12720456 11688175  91%    0.19K 605736       21   2422944K dentry
  182091   163690  89%    0.10K   4669       39     18676K buffer_head
   22496    22405  99%    0.86K    608       37     19456K 
ext4_inode_cache
21760    21760 100%    0.02K     85      256       340K ext4_io_page
21504    21504 100%    0.01K     42      512       168K kmalloc-8
17680    16830  95%    0.02K    104      170       416K numa_policy
11475     9558  83%    0.05K    135       85       540K
shared_policy_node
...
dentry-state
sudo cat /proc/sys/fs/dentry-state
11688070        11677721        45      0       0       0
see How to Create and Use DLLs in Delphi
sudo cat /proc/slabinfo | sort -rnk2 | head
dentry            11689648 12720456    192   21    1 : tunables 0 0 0 : 
slabdata 605736 605736 0
buffer_head         163690   182091    104   39    1 : tunables 0 0 0 :
slabdata   4669   4669 0
ext4_inode_cache     22405    22496    880   37    8 : tunables 0 0 0 :
slabdata    608    608 0
ext4_io_page         21760    21760     16  256    1 : tunables 0 0 0 :
slabdata     85     85 0
kmalloc-8            21504    21504      8  512    1 : tunables 0 0 0 :
slabdata     42     42 0
numa_policy          16830    17680     24  170    1 : tunables 0 0 0 :
slabdata    104    104 0
sysfs_dir_cache      11396    11396    144   28    1 : tunables 0 0 0 :
slabdata    407    407 0
kmalloc-64           11072    11072     64   64    1 : tunables 0 0 0 :
slabdata    173    173 0
kmalloc-32            9344     9344     32  128    1 : tunables 0 0 0 :
slabdata     73     73 0
sudo cat /proc/meminfo
MemTotal:        7114792 kB
MemFree:         1443160 kB
Buffers:          275232 kB
Cached:           446828 kB
SwapCached:            0 kB
Active:          2810096 kB
Inactive:         240064 kB
Active(anon):    2299088 kB
Inactive(anon):      720 kB
Active(file):     511008 kB
Inactive(file):   239344 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               260 kB
Writeback:             0 kB
AnonPages:       2299184 kB
Mapped:            27944 kB
Shmem:               772 kB
Slab:            2506124 kB
SReclaimable:    2479280 kB
SUnreclaim:        26844 kB
KernelStack:        3512 kB
PageTables:        12968 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7114792 kB
Committed_AS:    2626600 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       26116 kB
VmallocChunk:   34359710188 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     7348224 kB
DirectMap2M:           0 kB
On Wednesday, November 14, 2012 6:50:24 AM UTC-6, kimchy wrote:
Hi, a few notes here:
- The main reason mlockall is there is to make sure the memory
(ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped.
You can achieve that in other means, like setting swappiness. The reason
you don't want a java process to swap is because of the way the garbage
collector works, having to touch different parts of the process memory,
causing it to swap in and out a lot of pages. 
- Its perfectly fine to run elasticsearch with 24gb of memory, and even
more. You won't observe large pauses. We work hard in elasticsearch to make
sure we work nicely with the garbage collector to eliminate those pauses.
Many users run elasticsearch with 30gb of memory in production. 
- The more memory you have for the java process, the more memory can be
used for things like filter cache (its automatically using 20% of the heap
by default) and other related memory constructs. Leaving memory to the OS
is also important so the OS file system cache do its magic as well.
Usually, we recommend around 50% of OS memory to be allocate to the java
process, but prefer to not allocate more than 30gb (because then the JVM
can be smart and compress pointers sizes). 
Regarding memory not being released, thats strange. Can you double check
that there isn't a process still running? Once the process no longer
exists, it will not take the mentioned memory.
On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:
Thanks Jörg.
I completely understand why the JVM refuses to start with mlockall, the
question is why is there not enough free memory to begin with?
The difference between the nodes after ES has stopped:
Mem:         48264        950      47314          0         70        188
Mem:         48265      25470      22794          0         96        188
The latter node never releases the memory allocated toward it. Will be
upgrading to JDK7 shortly since there are various new GC options I want to
try out. But I would like to try things out with a clean slate and would
love to resolve the memory issue.
Ivan
On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante joerg...@gmail.com wrote:
Hi Ivan,
depending on the underlying OS memory organization, the JVM
initialization wants to be smart and tries to re-allocate in several steps
up to the mem size given in Xms to allocate the initial heap completely. On
the other hand, mlockall() is a single call via JNA, and this is not so
smart. This is certainly the reason why you observe mlockall() failures
before Xms heap allocation fails.
Since the standard JVM can not handle large heaps without stalls of
seconds or even minutes, you should reconsider your requirements. Extra
large heaps do not give extra large performance, quite contrary, they are
not good for performance. 24 GB is too much for the current standard JVM to
handle. You will get better and predictable performance with heaps of
4-8GB, because the CMS garbage collector is targeted to perform well in
that range. See also JEP 144: Reduce GC Latency for Large Heaps for an
enhancement call to create a better, scalable GC for larger RAMs.
Maybe you are interested in activating the G1 garbage collector in Java
7 Oracle JVM
Java HotSpot Garbage Collection
Cheers,
Jörg
--
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.