OutOfMemoryError on geo fields (geo_distance query)

HI folks,

We have an index of relatively small documents (around 2-5K per document)
with a count of around 4.5 Million docs. Around 40% of the docs have a
"location" field which is a geo_point.

When we were at around 1.3 Million I was able to execute a query with a
geo_distance filter with no problem (around 500ms response time), now at
4.5 Million docs I get this:

loading field [location] caused out of memory failure
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.common.trove.list.array.TDoubleArrayList.ensureCapacity(TDoubleArrayList.java:186)
at
org.elasticsearch.common.trove.list.array.TDoubleArrayList.add(TDoubleArrayList.java:221)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldData$StringTypeLoader.collectTerm(GeoPointFieldData.java:187)
at
org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:59)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldData.load(GeoPointFieldData.java:168)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:55)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:34)
at org.elasticsearch.index.field.data.FieldData.load(FieldData.java:111)
at
org.elasticsearch.index.cache.field.data.support.AbstractConcurrentMapFieldDataCache.cache(AbstractConcurrentMapFieldDataCache.java:130)
at
org.elasticsearch.index.search.geo.GeoDistanceFilter.getDocIdSet(GeoDistanceFilter.java:115)

I've increased the ES_MAX_MEM value in the startup script to 4GB:

ES_MAX_MEM=4g

Is there some ratio of geo_point count to RAM that I need to be aware of?
I am just running the default (out of the box) setup for ES:

number_of_nodes: 1
number_of_data_nodes: 1
active_primary_shards: 5
active_shards: 5

--

The geo_distance filter needs to all geo points (lat and long, as two
double values) field values to be loaded into memory for fast
filtering / distance calculation. So the ratio is 1, everything in
RAM. From what I understand is that you have multiple geo points per
document around 2k to 5k, right? This can make the field data cache
entries (which the geo_distance filter uses) very large.

You can also see how big the field data cache is for each node in your
cluster. You can use the node stats api for this:

I think this gives you a better insight and based on this you might
decide to increase the heap space size even further. If you use the
jvm flag you can also see the used heap space (be aware this also
includes memory to be garbage collected).

Btw I recommend setting the ES_HEAP_SIZE instead of the ES_MAX_MEM.
Also are you sure that the process isn't swapping? This can result is
bad performance. Use the bootstrap.mlockall option to prevent this, if
this is the case.

Martijn

On 7 January 2013 23:01, Jason jason.polites@gmail.com wrote:

HI folks,

We have an index of relatively small documents (around 2-5K per document)
with a count of around 4.5 Million docs. Around 40% of the docs have a
"location" field which is a geo_point.

When we were at around 1.3 Million I was able to execute a query with a
geo_distance filter with no problem (around 500ms response time), now at 4.5
Million docs I get this:

loading field [location] caused out of memory failure
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.common.trove.list.array.TDoubleArrayList.ensureCapacity(TDoubleArrayList.java:186)
at
org.elasticsearch.common.trove.list.array.TDoubleArrayList.add(TDoubleArrayList.java:221)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldData$StringTypeLoader.collectTerm(GeoPointFieldData.java:187)
at
org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:59)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldData.load(GeoPointFieldData.java:168)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:55)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:34)
at org.elasticsearch.index.field.data.FieldData.load(FieldData.java:111)
at
org.elasticsearch.index.cache.field.data.support.AbstractConcurrentMapFieldDataCache.cache(AbstractConcurrentMapFieldDataCache.java:130)
at
org.elasticsearch.index.search.geo.GeoDistanceFilter.getDocIdSet(GeoDistanceFilter.java:115)

I've increased the ES_MAX_MEM value in the startup script to 4GB:

ES_MAX_MEM=4g

Is there some ratio of geo_point count to RAM that I need to be aware of? I
am just running the default (out of the box) setup for ES:

number_of_nodes: 1
number_of_data_nodes: 1
active_primary_shards: 5
active_shards: 5

--

--
Met vriendelijke groet,

Martijn van Groningen

--

Hi Martjin,

Thanks for your response. I checked the node stats as you suggested and it
looks like there may be a problem with the JVM's heap allocation.

If I try to execute a query with a geo_distance filter the process will
fail with an OutOfMemoryException so I can't get a good read on exactly how
much it needs, but I was able to see how much it used at the time of
failure.

The field cache was:

field_size: "943.3mb"

While the JVM heap was:

heap_used: "1001.6mb"

This heap figure is suspiciously close to 1GB which tells me that either
the ES_MAX_MEM setting is not sticking, or the machine just won't give up
any more RAM (I'm not sure if paging is enabled on the box or not). Either
way this explains the OutOfMemory and the simplest immediate solution is
just to get a larger box, but my problem is that this is not a solution for
us in the longer term.

We currently have around 4.5 million records. Around 40% of those have at
least one location, with around 10% (of the total) having more than one.
However we have only processed a fraction of the raw data that we have an
we expect to end up with around 400 million records. If I need a whole
server (node) for just 4.5 million then I'll need somewhere between 50 and
100 nodes to be able to deal with 400 million. This is just not viable for
us and I'm confident that without geo searches Lucene (and Elasticsearch)
could handle several hundred million records without too many problems on
just a couple of servers.

Is there any way to perform a geo_distance query that does not require so
much memory? We have discussed implementing our own solution by simply
indexing a "quad tree" for each document thereby limiting results with a
simple bounding box, then doing a final filter in memory of the smaller
result set. This would use considerably less memory and although it may
not be as fast at least it would not mean we needed hundreds of servers.
But I feel like this is not something we want to build ourselves.

If I said something like:

"We have 100 million documents, each with at least one location and some
with more than one and we want to perform geo distance style queries".
Would you say that Elasticsearch is a good solution for this?

Thanks for your help.

Jason.

On Tue, Jan 8, 2013 at 5:50 AM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

The geo_distance filter needs to all geo points (lat and long, as two
double values) field values to be loaded into memory for fast
filtering / distance calculation. So the ratio is 1, everything in
RAM. From what I understand is that you have multiple geo points per
document around 2k to 5k, right? This can make the field data cache
entries (which the geo_distance filter uses) very large.

You can also see how big the field data cache is for each node in your
cluster. You can use the node stats api for this:

Elasticsearch Platform — Find real-time answers at scale | Elastic
I think this gives you a better insight and based on this you might
decide to increase the heap space size even further. If you use the
jvm flag you can also see the used heap space (be aware this also
includes memory to be garbage collected).

Btw I recommend setting the ES_HEAP_SIZE instead of the ES_MAX_MEM.
Also are you sure that the process isn't swapping? This can result is
bad performance. Use the bootstrap.mlockall option to prevent this, if
this is the case.

Martijn

On 7 January 2013 23:01, Jason jason.polites@gmail.com wrote:

HI folks,

We have an index of relatively small documents (around 2-5K per document)
with a count of around 4.5 Million docs. Around 40% of the docs have a
"location" field which is a geo_point.

When we were at around 1.3 Million I was able to execute a query with a
geo_distance filter with no problem (around 500ms response time), now at
4.5
Million docs I get this:

loading field [location] caused out of memory failure
java.lang.OutOfMemoryError: Java heap space
at

org.elasticsearch.common.trove.list.array.TDoubleArrayList.ensureCapacity(TDoubleArrayList.java:186)

at

org.elasticsearch.common.trove.list.array.TDoubleArrayList.add(TDoubleArrayList.java:221)

at

org.elasticsearch.index.mapper.geo.GeoPointFieldData$StringTypeLoader.collectTerm(GeoPointFieldData.java:187)

at

org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:59)

at

org.elasticsearch.index.mapper.geo.GeoPointFieldData.load(GeoPointFieldData.java:168)

at

org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:55)

at

org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:34)

at org.elasticsearch.index.field.data.FieldData.load(FieldData.java:111)
at

org.elasticsearch.index.cache.field.data.support.AbstractConcurrentMapFieldDataCache.cache(AbstractConcurrentMapFieldDataCache.java:130)

at

org.elasticsearch.index.search.geo.GeoDistanceFilter.getDocIdSet(GeoDistanceFilter.java:115)

I've increased the ES_MAX_MEM value in the startup script to 4GB:

ES_MAX_MEM=4g

Is there some ratio of geo_point count to RAM that I need to be aware
of? I
am just running the default (out of the box) setup for ES:

number_of_nodes: 1
number_of_data_nodes: 1
active_primary_shards: 5
active_shards: 5

--

--
Met vriendelijke groet,

Martijn van Groningen

--

--
Ozzy's Odyssey! A new game for Android
https://market.android.com/details?id=com.carboncrystal.odyssey
http://www.carboncrystal.com/ http://www.carboncrystal.com/droid-odyssey/

--

The field cache was:

field_size: "943.3mb"

While the JVM heap was:

heap_used: "1001.6mb"

This heap figure is suspiciously close to 1GB which tells me that either the
ES_MAX_MEM setting is not sticking, or the machine just won't give up any
more RAM (I'm not sure if paging is enabled on the box or not). Either way
this explains the OutOfMemory and the simplest immediate solution is just to
get a larger box, but my problem is that this is not a solution for us in
the longer term
The ES_MAX_MEM doesn't set the jvm's -xms (initial heap size). The
ES_HEAP_SIZE does set the -xms and -xmx (maximum heap size) option.
If ES_MAX_MEM is set to 4GB, the jvm will grow the heap space when the
existing heap space is getting full. So initially you will see that
the used heap space is 1GB. This is why I recommended to use
ES_HEAP_SIZE instead.

At the moment how are you defining the geo_distance filter in the
query dsl? The performance of the geo_distance filter can vary
considerably based on where it is placed in the dsl.

If I said something like:

"We have 100 million documents, each with at least one location and some
with more than one and we want to perform geo distance style queries".
Would you say that Elasticsearch is a good solution for this?
If you're more interested in geo filtering (not on based on distance
but whether results belong to a specific area), then I suggest you
look into the geo_shape filter:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Btw what ES and Java version are you using at the moment?

Martijn

--

The geo_distance filter is specified as part of a "filter" element like so:

{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"term": {
"foo": "bar"
}
}
]
}
},
"filter": [
{
"geo_distance": {
"distance": 0.3728226,
"location": {
"lat": 40.797307367399654,
"lon": -74.30757522583008
}
}
}
]
}
}
}

I'll check out the geo shape filter, thanks.

We're running on an ubuntu ec2 server so the latest java version available
is 1.6.0.22 (openjdk):

java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.1) (6b22-1.10.1-0ubuntu1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

I'm not so concerned about query performance as much as I am concerned
about memory usage. I'm ok if a query takes several seconds, I just don't
want to have to run loads of servers to do it.

Since the last email I've upgraded the ec2 server to a high memory instance
(17GB RAM) and set the max heap size to 10G. Here's the current usage
stats:

cache: {

  • field_evictions: 0
  • field_size: "1.7gb"
  • field_size_in_bytes: 1918358410
  • filter_count: 0
  • filter_evictions: 0
  • filter_size: "0b"
  • filter_size_in_bytes: 0
  • bloom_size: "7mb"
  • bloom_size_in_bytes: 7369128
  • id_cache_size: "0b"
  • id_cache_size_in_bytes: 0

}

jvm

  • heap_used: "5.6gb"
  • heap_committed: "9.9gb"
  • non_heap_used: "42mb"
  • non_heap_committed: "67.2mb"

As you can see the cache size is 1.7GB. At this rate we'd need 10 of these
servers to host 10x the number of records under the current schema. This
would cost around 6K per month on AWS, which is just not tenable. (I'm
assuming heap_used is high simply because Java hasn't needed to GC)

If we switch to a geo_shape, would that mean we don't need as much memory?

On Tue, Jan 8, 2013 at 4:07 PM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

The field cache was:

field_size: "943.3mb"

While the JVM heap was:

heap_used: "1001.6mb"

This heap figure is suspiciously close to 1GB which tells me that either
the
ES_MAX_MEM setting is not sticking, or the machine just won't give up any
more RAM (I'm not sure if paging is enabled on the box or not). Either
way
this explains the OutOfMemory and the simplest immediate solution is
just to
get a larger box, but my problem is that this is not a solution for us in
the longer term
The ES_MAX_MEM doesn't set the jvm's -xms (initial heap size). The
ES_HEAP_SIZE does set the -xms and -xmx (maximum heap size) option.
If ES_MAX_MEM is set to 4GB, the jvm will grow the heap space when the
existing heap space is getting full. So initially you will see that
the used heap space is 1GB. This is why I recommended to use
ES_HEAP_SIZE instead.

At the moment how are you defining the geo_distance filter in the
query dsl? The performance of the geo_distance filter can vary
considerably based on where it is placed in the dsl.

If I said something like:

"We have 100 million documents, each with at least one location and some
with more than one and we want to perform geo distance style queries".
Would you say that Elasticsearch is a good solution for this?
If you're more interested in geo filtering (not on based on distance
but whether results belong to a specific area), then I suggest you
look into the geo_shape filter:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Btw what ES and Java version are you using at the moment?

Martijn

--

--
Ozzy's Odyssey! A new game for Android
https://market.android.com/details?id=com.carboncrystal.odyssey
http://www.carboncrystal.com/ http://www.carboncrystal.com/droid-odyssey/

--

The geo_distance filter is specified as part of a "filter" element like so:

{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"term": {
"foo": "bar"
}
}
]
}
},
"filter": [
{
"geo_distance": {
"distance": 0.3728226,
"location": {
"lat": 40.797307367399654,
"lon": -74.30757522583008
}
}
}
]
}
}
}
If you're not using facets it is beneficial to put the geo_distance
filter as a top level filter in the your search request.

java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.1) (6b22-1.10.1-0ubuntu1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)
Actually this is a very old Java version. I recommend upgrading to the
latest Java 7 release (update 10) or at least the lastest Java 6
(update 38) release. This will improve the performance and stability
of your cluster.

As you can see the cache size is 1.7GB. At this rate we'd need 10 of these
servers to host 10x the number of records under the current schema. This
would cost around 6K per month on AWS, which is just not tenable. (I'm
assuming heap_used is high simply because Java hasn't needed to GC)
The geo_distance filter relies on the fact that all geo points are
loaded into memory. There is no way around that at the moment.
The heap_used is higher because it includes unreferenced objects.
After a full gc the heap_used should be much lower.

If we switch to a geo_shape, would that mean we don't need as much memory?
Yes. The geo_shape filter doesn't rely on the fielddata cache.

Martijn

--

Hi Martijn,

Yes we are using facets, but I'll tinker with the positioning of the
geo_distance filter to see if it makes a difference.

Regarding the Java version, this is just the default package that ubuntu
"recommends" to install (Java - Community Help Wiki) but I take
your point and will look to update.

We may switch to geo_shape if it doesn't require as much memory, we'll just
have to alter our UI a bit (we're currently using circles overlaid on
Google Maps, so we'll have to switch to polygons)

Thanks for your great feedback. I really appreciate you taking the time.

Cheers,

Jason.

On Wed, Jan 9, 2013 at 12:51 AM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

The geo_distance filter is specified as part of a "filter" element like
so:

{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"term": {
"foo": "bar"
}
}
]
}
},
"filter": [
{
"geo_distance": {
"distance": 0.3728226,
"location": {
"lat": 40.797307367399654,
"lon": -74.30757522583008
}
}
}
]
}
}
}
If you're not using facets it is beneficial to put the geo_distance
filter as a top level filter in the your search request.

java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.1) (6b22-1.10.1-0ubuntu1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)
Actually this is a very old Java version. I recommend upgrading to the
latest Java 7 release (update 10) or at least the lastest Java 6
(update 38) release. This will improve the performance and stability
of your cluster.

As you can see the cache size is 1.7GB. At this rate we'd need 10 of
these
servers to host 10x the number of records under the current schema. This
would cost around 6K per month on AWS, which is just not tenable. (I'm
assuming heap_used is high simply because Java hasn't needed to GC)
The geo_distance filter relies on the fact that all geo points are
loaded into memory. There is no way around that at the moment.
The heap_used is higher because it includes unreferenced objects.
After a full gc the heap_used should be much lower.

If we switch to a geo_shape, would that mean we don't need as much
memory?
Yes. The geo_shape filter doesn't rely on the fielddata cache.

Martijn

--

--
Ozzy's Odyssey! A new game for Android
https://market.android.com/details?id=com.carboncrystal.odyssey
http://www.carboncrystal.com/ http://www.carboncrystal.com/droid-odyssey/

--