I've just tried loading a million documents into ElasticSearch, running
on a small dev server, and memory usage grew until eventually there was
none left, and it refused to accept any more docs.
I switched to using the file system rather than memory, and everything
worked nicely (except a bit slower obviously)
However, I have another 4 million docs to load which will take up a LOT
of memory.
Does the sharding mean that: if the memory usage of the node in a
cluster with a single node is 4GB, then the memory usage on each node in
a cluster with 4 nodes will be 1GB (approx)?
thanks
clint
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.
I've just tried loading a million documents into Elasticsearch, running
on a small dev server, and memory usage grew until eventually there was
none left, and it refused to accept any more docs.
I switched to using the file system rather than memory, and everything
worked nicely (except a bit slower obviously)
Did you use the native memory one (i.e. in 0.5 and above, type: memory in
the store with no other argument)? In theory, you are then bounded by the
physical memory, and not by how much memory you allocate to the JVM. In this
case, by the way, I suggest using large bufferSize.
However, I have another 4 million docs to load which will take up a LOT
of memory.
Does the sharding mean that: if the memory usage of the node in a
cluster with a single node is 4GB, then the memory usage on each node in
a cluster with 4 nodes will be 1GB (approx)?
Yep. Thats the idea. Don't forget the replicas though. If you have 5 shards
with 1 replica each, then 1 node will take 4G, two nodes will each take 4G
(because of the replicas).
thanks
clint
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.
Did you use the native memory one (i.e. in 0.5 and above, type: memory
in the store with no other argument)?
yes
In theory, you are then bounded by the physical memory, and not by how
much memory you allocate to the JVM.
yes - it was a small machine - only 2GB of memory. But that said,
700,000 objects was using 1.4GB. I currently need to index 5 million
objects, which will be a lot of memory
In this case, by the way, I suggest using large bufferSize.
You mean when using index.storage.type = 'memory' ? Why a large
bufferSize? And how big is considered large?
Yep. Thats the idea. Don't forget the replicas though. If you have 5
shards with 1 replica each, then 1 node will take 4G, two nodes will
each take 4G (because of the replicas).
OK to understand this:
I have (eg) 5GB of data when running with one node, which has 5 shards.
If I start 5 nodes, with 5 shards and 2 replicas, then I would have:
2 nodes using 5GB
3 nodes using 1GB
Is this correct?
Still trying to get my head around how this all works
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.
Answers below. But before that, let me point you to elasticsearch multi
index support. It basically means that you can have one index in memory,
which is smaller, and another index that is stored on the file system. For
example, if you can break them based on types, it might make sense. Remember
that you can search on several indices with elasticsearch.
Did you use the native memory one (i.e. in 0.5 and above, type: memory
in the store with no other argument)?
yes
In theory, you are then bounded by the physical memory, and not by how
much memory you allocate to the JVM.
yes - it was a small machine - only 2GB of memory. But that said,
700,000 objects was using 1.4GB. I currently need to index 5 million
objects, which will be a lot of memory
So, it means that if you have 5 simple machines with 4gb you would have 20g
of memory :).
In this case, by the way, I suggest using large bufferSize.
You mean when using index.storage.type = 'memory' ? Why a large
bufferSize? And how big is considered large?
I would say 100k - 200k bufferSize is a good value.
Yep. Thats the idea. Don't forget the replicas though. If you have 5
shards with 1 replica each, then 1 node will take 4G, two nodes will
each take 4G (because of the replicas).
OK to understand this:
I have (eg) 5GB of data when running with one node, which has 5 shards.
If I start 5 nodes, with 5 shards and 2 replicas, then I would have:
2 nodes using 5GB
3 nodes using 1GB
Is this correct?
Let me simplify the math. If you have 5 shards, each with 2 replicas, you
have, in total, 5 * (2 + 1) instances of shards running. In this case 15
instances of shards (shards and their replicas).
If you have 5g with one node, that has 5 shards, then lets assume we have 1g
per shard. This means that for 15 instances of shards you would need 15g.
Still trying to get my head around how this all works
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.