Hi Martjin,
Thanks for your response. I checked the node stats as you suggested and it
looks like there may be a problem with the JVM's heap allocation.
If I try to execute a query with a geo_distance filter the process will
fail with an OutOfMemoryException so I can't get a good read on exactly how
much it needs, but I was able to see how much it used at the time of
failure.
The field cache was:
field_size: "943.3mb"
While the JVM heap was:
heap_used: "1001.6mb"
This heap figure is suspiciously close to 1GB which tells me that either
the ES_MAX_MEM setting is not sticking, or the machine just won't give up
any more RAM (I'm not sure if paging is enabled on the box or not). Either
way this explains the OutOfMemory and the simplest immediate solution is
just to get a larger box, but my problem is that this is not a solution for
us in the longer term.
We currently have around 4.5 million records. Around 40% of those have at
least one location, with around 10% (of the total) having more than one.
However we have only processed a fraction of the raw data that we have an
we expect to end up with around 400 million records. If I need a whole
server (node) for just 4.5 million then I'll need somewhere between 50 and
100 nodes to be able to deal with 400 million. This is just not viable for
us and I'm confident that without geo searches Lucene (and Elasticsearch)
could handle several hundred million records without too many problems on
just a couple of servers.
Is there any way to perform a geo_distance query that does not require so
much memory? We have discussed implementing our own solution by simply
indexing a "quad tree" for each document thereby limiting results with a
simple bounding box, then doing a final filter in memory of the smaller
result set. This would use considerably less memory and although it may
not be as fast at least it would not mean we needed hundreds of servers.
But I feel like this is not something we want to build ourselves.
If I said something like:
"We have 100 million documents, each with at least one location and some
with more than one and we want to perform geo distance style queries".
Would you say that Elasticsearch is a good solution for this?
Thanks for your help.
Jason.
On Tue, Jan 8, 2013 at 5:50 AM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:
The geo_distance filter needs to all geo points (lat and long, as two
double values) field values to be loaded into memory for fast
filtering / distance calculation. So the ratio is 1, everything in
RAM. From what I understand is that you have multiple geo points per
document around 2k to 5k, right? This can make the field data cache
entries (which the geo_distance filter uses) very large.
You can also see how big the field data cache is for each node in your
cluster. You can use the node stats api for this:
Elasticsearch Platform — Find real-time answers at scale | Elastic
I think this gives you a better insight and based on this you might
decide to increase the heap space size even further. If you use the
jvm flag you can also see the used heap space (be aware this also
includes memory to be garbage collected).
Btw I recommend setting the ES_HEAP_SIZE instead of the ES_MAX_MEM.
Also are you sure that the process isn't swapping? This can result is
bad performance. Use the bootstrap.mlockall option to prevent this, if
this is the case.
Martijn
On 7 January 2013 23:01, Jason jason.polites@gmail.com wrote:
HI folks,
We have an index of relatively small documents (around 2-5K per document)
with a count of around 4.5 Million docs. Around 40% of the docs have a
"location" field which is a geo_point.
When we were at around 1.3 Million I was able to execute a query with a
geo_distance filter with no problem (around 500ms response time), now at
4.5
Million docs I get this:
loading field [location] caused out of memory failure
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.common.trove.list.array.TDoubleArrayList.ensureCapacity(TDoubleArrayList.java:186)
at
org.elasticsearch.common.trove.list.array.TDoubleArrayList.add(TDoubleArrayList.java:221)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldData$StringTypeLoader.collectTerm(GeoPointFieldData.java:187)
at
org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:59)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldData.load(GeoPointFieldData.java:168)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:55)
at
org.elasticsearch.index.mapper.geo.GeoPointFieldDataType.load(GeoPointFieldDataType.java:34)
at org.elasticsearch.index.field.data.FieldData.load(FieldData.java:111)
at
org.elasticsearch.index.cache.field.data.support.AbstractConcurrentMapFieldDataCache.cache(AbstractConcurrentMapFieldDataCache.java:130)
at
org.elasticsearch.index.search.geo.GeoDistanceFilter.getDocIdSet(GeoDistanceFilter.java:115)
I've increased the ES_MAX_MEM value in the startup script to 4GB:
ES_MAX_MEM=4g
Is there some ratio of geo_point count to RAM that I need to be aware
of? I
am just running the default (out of the box) setup for ES:
number_of_nodes: 1
number_of_data_nodes: 1
active_primary_shards: 5
active_shards: 5
--
--
Met vriendelijke groet,
Martijn van Groningen
--
--
Ozzy's Odyssey! A new game for Android
https://market.android.com/details?id=com.carboncrystal.odyssey
http://www.carboncrystal.com/ http://www.carboncrystal.com/droid-odyssey/
--