Multiple cluster state copies in memory VS many aliases

(Oleksiy Kovyrin) #1

Hey guys,

I'm not sure if you could do anything about the issue, but still, I wanted
to make sure you are aware of it. So, here is the story:

I work for Due to the nature of our business and the way we
implement user search engines as separate or aggregated indexes in ES and
use per-customer aliases to route requests to an appropriate place. So, we
end up with clusters that have ~1000 shards and could easily have 50-150k
aliases. ES holds up really well under load and the only issue we have
which has been the major pain point for us for the last 4 months is the way
ES handles cluster state changes.

Every time we add/remove an alias or an index, ES generates gigabytes of
garbage in java memory. Adding more java heap does not really help, because
then those piles of garbage start causing very long old gen pauses and that
is really painful for us since our clusters are constantly under load and
having them stop for seconds to collect garbage is unacceptable.

For months we've been building all kinds of bandages to mitigate the
effects of this issue (like creating an external per-cluster locking system
to avoid consurrent cluster state changes and other crazy stuff like that)
and trying to figure out why was it was happening without much luck.

Yesterday we've added a master-only (no data) node to one of our small
clusters and gave it 4G of heap (with new ratio = 3). Even during the
process of joining the cluster it went to GC for 5-10 sec multiple times,
every time failing to join becuase of timeouts. After many retries it'd
finally join and old gen would keep steady at ~1.2Gb (out of 3Gb we gave
it) for hours and hours. But then, when some users would start
creating/changing their indexes too frequently (once every few seconds),
the server would just go crazy and end up in a Full GC trashing mode.

I've made a java heap dump from this server today and since there is no
Lucene crap in it (as it usually is on larger instances with data), it was
REALLY obvious now where the problem is: in my dump I see our cluster
MetaData object size is ~681Mb and we have 6 (!) copies of it live in

  • one copy in current cluster metadata
  • one copy in a running InternalClusterService$UpdateTask
  • and 4 copies in InternalClusterService$UpdateTask instances in the
    blocking queue for update service.

I understand that we could add many more gigabytes of RAM and that would
help us to handle such "surges" of cluster state updates, but I'm afraid
that would just cost us a lot in GC pauses and would not really solve the
issues we're having on data nodes anyways.

So, I was wondering if you have any suggestions on how can we mitigate this
issue in the short term (aside from not creating so many aliases, but that
does it not really possible for us at the moment) and if you have any ideas
of how it could be solved in the server code in the long run.

I'd really appreciate any help!

P.S. We're on 0.90.5. If there are any fixes around cluster state handling
in the newer versions, please let me know. We're going to be upgrading our
clusters very soon and if there is a chance new release would help with GC
issues, I'd bump up the priority of the upgrade task in our task list.

Oleksiy Kovyrin
Head of Technical Operations

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit
For more options, visit

(system) #2