My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569 B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
Those are coming from Lucene. I've always seen a lot of them with
jmap, even in very healthy situations.
You may want to review other JVM params that control various size
ratios or when GC kicks in, etc.
My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569 B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
Thanks Otis. But why aren't these instances being collected during
garbage collection? Do we know which mechanism in elasticsearch is
holding all these references? I had assumed it was the field data
cache or the filter cache but the stats show these caches are very
small.
Those are coming from Lucene. I've always seen a lot of them with
jmap, even in very healthy situations.
You may want to review other JVM params that control various size
ratios or when GC kicks in, etc.
My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569 B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
Our cluster died just now and I think it is because of this issue with
old gen being filled up with references to Term or TermInfo objects.
17 of 20 nodes were 99.999 old gen used. These nodes only had average
cpu load of 1.0.
The stack trace from one of the nodes shows search threads being
blocked by a FieldDataLoader object. http://pastie.org/3220393
I'm guessing FieldDataLoader was having problems because of garbage
collection hell, here are the gc stats before we had to restart
(concurrent mark sweep):
So the question remains: why is my old gen filled up with Term/
TermInfo, and not being collected, and is it the cause of my searches
being blocked? It has 24GB size, my field data cache never grew over
7GB according to BigDesk. Filter cache is default to 20% memory.
Those are coming from Lucene. I've always seen a lot of them with
jmap, even in very healthy situations.
You may want to review other JVM params that control various size
ratios or when GC kicks in, etc.
My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569 B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
So the question remains: why is my old gen filled up with Term/
TermInfo, and not being collected, and is it the cause of my searches
being blocked? It has 24GB size, my field data cache never grew over
7GB according to BigDesk. Filter cache is default to 20% memory.
I believe sorting and facets require loading a lot of terms into
memory, could it be that?
So the question remains: why is my old gen filled up with Term/
TermInfo, and not being collected, and is it the cause of my searches
being blocked? It has 24GB size, my field data cache never grew over
7GB according to BigDesk. Filter cache is default to 20% memory.
I believe sorting and facets require loading a lot of terms into
memory, could it be that?
The TermInfo instances are from Lucene's terms index.
Every 128th term is held in memory... so if you have many terms, that
can become sizable.
However: as of Lucene 3.5.0 the RAM required is substantially reduced
(LUCENE-2205). The terms are written into a more compact in-memory
format instead of a TermInfo+Term+String per term.
If upgrading is not an option then you can also set the terms index
divisor (not sure how to do so through Elasticsearch); eg setting it
to 2 loads every 256th term instead and uses half the RAM, but then
seeking to a given term will be slower.
My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569 B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
The TermInfo instances are from Lucene's terms index.
Every 128th term is held in memory... so if you have many terms, that
can become sizable.
However: as of Lucene 3.5.0 the RAM required is substantially reduced
(LUCENE-2205). The terms are written into a more compact in-memory
format instead of a TermInfo+Term+String per term.
If upgrading is not an option then you can also set the terms index
divisor (not sure how to do so through Elasticsearch); eg setting it
to 2 loads every 256th term instead and uses half the RAM, but then
seeking to a given term will be slower.
My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569 B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
You can change the term index divisor on a live cluster using the update
settings API. But, it will make search slower. Which version of
elasticsearch are you using? Lucene 3.5.0 (as Mike noted) that is part of
0.18.5 (and above) has much better memory when it comes to it.
The TermInfo instances are from Lucene's terms index.
Every 128th term is held in memory... so if you have many terms, that
can become sizable.
However: as of Lucene 3.5.0 the RAM required is substantially reduced
(LUCENE-2205). The terms are written into a more compact in-memory
format instead of a TermInfo+Term+String per term.
If upgrading is not an option then you can also set the terms index
divisor (not sure how to do so through Elasticsearch); eg setting it
to 2 loads every 256th term instead and uses half the RAM, but then
seeking to a given term will be slower.
My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569 B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
You can change the term index divisor on a live cluster using the update
settings API. But, it will make search slower. Which version of
elasticsearch are you using? Lucene 3.5.0 (as Mike noted) that is part of
0.18.5 (and above) has much better memory when it comes to it.
The TermInfo instances are from Lucene's terms index.
Every 128th term is held in memory... so if you have many terms, that
can become sizable.
However: as of Lucene 3.5.0 the RAM required is substantially reduced
(LUCENE-2205). The terms are written into a more compact in-memory
format instead of a TermInfo+Term+String per term.
If upgrading is not an option then you can also set the terms index
divisor (not sure how to do so through Elasticsearch); eg setting it
to 2 loads every 256th term instead and uses half the RAM, but then
seeking to a given term will be slower.
My caches are relatively small in size, so I'm wondering what is
chewing up all my old gen space? Here is some more information.
From BigDesk:
Number of documents: 390519818, Store size: 1221.1gb (1311185373569B)
Field cache evictions: 0, Field cache size: 7.3gb, Filter cache size:
6.2gb
Merges: Current: 0, Total: 283, Took: 6.6m
We were definitely over sharding, the JVM settings needed to be tuned,
and we weren't fully taking advantage of our disks. After moving from
weekly partitions (indices) to quarterly, we had copious amounts of
RAM to give back for file caching. We then set number of shards +
replicas to evenly distribute across the nodes. The next issue was
query speed, and we realized that our RAID0 configuration wasn't
cutting it. Now all disks are working evenly and each node is
efficient. We're in a good place now (I think), but the next
optimization we are thinking about is to use routing so that each
shard holds a time interval of data. This would make searches with
smaller date ranges hit less machines to do work, and it efficiently
puts more machines to work as the search date range grows larger.
Does this seem like a good use case for routing?
If you already create an index per time interval, then you can use that to control what timespan you search on, I think using time interval for routing within an index might not really be needed.
On Monday, January 30, 2012 at 6:22 PM, lukeforehand wrote:
Thanks for the replies... Here are some results:
We were definitely over sharding, the JVM settings needed to be tuned,
and we weren't fully taking advantage of our disks. After moving from
weekly partitions (indices) to quarterly, we had copious amounts of
RAM to give back for file caching. We then set number of shards +
replicas to evenly distribute across the nodes. The next issue was
query speed, and we realized that our RAID0 configuration wasn't
cutting it. Now all disks are working evenly and each node is
efficient. We're in a good place now (I think), but the next
optimization we are thinking about is to use routing so that each
shard holds a time interval of data. This would make searches with
smaller date ranges hit less machines to do work, and it efficiently
puts more machines to work as the search date range grows larger.
Does this seem like a good use case for routing?
If you already create an index per time interval, then you can use that to control what timespan you search on, I think using time interval for routing within an index might not really be needed.
On Monday, January 30, 2012 at 6:22 PM, lukeforehand wrote:
Thanks for the replies... Here are some results:
We were definitely over sharding, the JVM settings needed to be tuned,
and we weren't fully taking advantage of our disks. After moving from
weekly partitions (indices) to quarterly, we had copious amounts of
RAM to give back for file caching. We then set number of shards +
replicas to evenly distribute across the nodes. The next issue was
query speed, and we realized that our RAID0 configuration wasn't
cutting it. Now all disks are working evenly and each node is
efficient. We're in a good place now (I think), but the next
optimization we are thinking about is to use routing so that each
shard holds a time interval of data. This would make searches with
smaller date ranges hit less machines to do work, and it efficiently
puts more machines to work as the search date range grows larger.
Does this seem like a good use case for routing?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.