First, lets get the single thread thingy out of the way, since it has to be a single thread. There is no option to really have multiple threads updating the same cluster state, in a similar manner that you can't have multiple nodes updating it (thats why we have one master in the cluster). Even if you had several threads updating it, you would basically need to sync the whole update and republish process, which effectively means that a single thread process it at a time.
Trying to allocate unassigned shards has to happen when a shard is started. You can't really tell which shards need/can be allocated next, so you need to run it in order to find out once a shard has started.
In the other thread I explained that one improvement that we can make, which is more relevant to your specific usecase where indices are small, is to batch the started shards in a time windows, and then run rerouting with the batched started shards, instead of running it immediately when a shard is started.
I need to check why reroute takes so long, thats strange since its mainly in memory operations (once the initial caching of state from other nodes has been retrieved). There are places that I know can be improved for very large number of shards (one of them is the batching mentioned above, another is to try and be smart as to what to publish in terms of cluster state).
Btw, you end up with a lot of shards per node, each a Lucene index, can they handle the load in this case?
On Saturday, June 4, 2011 at 8:53 PM, Engagor wrote:
Pretty sure my issues were related with the problem on
updateTaskExecutor. Since during the import of my existing dataset, a
lot of shard started events were sent meaning that the queueing became
an issue and as such OOM.
After further investigating this issue, I've found something in the
code that seems a bit fishy and in my case seems the main cause of the
slow recovery (and possibly the OOMs). The problem lies in the master
receiving a "shard started event":
- all these tasks execute in single thread on the master:
The last method will eventually try to initialize all unassigned
shards from that moment where only the just started one has been
updated (single threaded)
Consider the following example:
- 1000 indices (each 1 shard) on 4 nodes
After full cluster restart:
- master will init node_initial_primaries_recoveries shards on every
- on each node, these shards will be recovered/started and each
dispatch a "started" event
- suppose that node_initial_primaries_recoveries = 10
- First run of allocateUnassigned will init 40 shards accross 4
- all of these send a start event
- the first start event does all the things on a cluster with 0
- when done, 1 shard is active and an additional 39 shard starts
have been send out (while in the meanwhile we most likely already
received events from others, but single threaded, so we do not know)
- in the meanwhile it is quite possible that most of the 40 nodes
have already recovered and as such also sent out the started events
In my case, running the reroute method takes approx 2s, which is long,
especially since I have 14000 shards. This means that it would take
approx 8 hours to finish. Rememeber that in this case no real data has
actually been recovered, only changes to the in memory routing table.
On Jun 4, 6:39 pm, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:
Is that still the case based on the other thread you posted?
You can check the memory usage using the node stats API, and see if teh filter cache or field cache are taking any memory. Filter cache can be completely disabled by setting: index.cache.filter.type to
none. But, the new
node level filter cache is actually quite good for cases of multi tenancy, see more here:http://www.elasticsearch.org/guide/reference/index-modules/cache.html.
Field cache is something quite expensive to load, so you would want it loaded all the time. Same goes for a design of closing searches (which you can't do with elasticsearch) and not keeping them open. You can close indices and open them on demand though.
On Saturday, June 4, 2011 at 11:46 AM, Engagor wrote:
I'm one of the people behindhttp://engagor.comand we are currently
using Solr for as search engine.
Our application is probably an atypical one since we have lots of
writes and few reads compared to the typical usage where apps have a
lot of reads and few writes.
Our Solr setup can currently handle per node:
- roughly 400M documents (documents are very small and no data is
- the documents are split accross roughly 4000 cores
The default Solr code is not capable of starting a server with the
above content. If you would try this, then memory usage would increase
to more than 12GB and startup time would take multiple minutes.
In order to allow the above setup, we've made the following
adjustments to Solr:
- use the same schema and config for all cores
- lazy loading of all cores / analyzers / handlers
- at any given time, only a preconfigured number of Searchers is
active (in our case 10). Searchers get dynamically loaded, unloaded
- by default, after a write, Solr will preload a new searcher for
that core. In our setup, after a write, Solr simply closes existing
searchers for that core and loads new ones dynamically
- term indexes mostly used for facetting are also cached with LRU.
At any given time, we have at most a preconfigured number of them in
With the adjustments above, our solr setup starts within a few seconds
and at most times keeps memory usage below 512M while still being very
fast (still fast even without caching since the cores contain 10k
documents on average).
Due to the better cluster support and more advanced out of the box
searching capabilities, we recently decided to switch to
ElasticSearch. It would be great if we could implement a similar setup
memory usage wise to the above with ES.
The ElasticSearch setup consists of a cluster of 3 nodes (where we
used to have one) with replicate=1.
While indexing our existing dataset (so no searching), we already
quickly hit OutOfMemory errors.
Are all settings realted to caching in
Where can I find a description of the different types of caches being
used? resident, soft and weak?
I assume the reason we hit OOM errors is due to the Field Data Cache?
Setting the type to soft and having additionally
index.cache.field.expire set would probably fix the memory issues in
Another question I have is about the Filter Cache. Since we have only
few searches, does it make sense in our setup to simply not using
filters and always including the extra filtering in our main query?