So an Elasticsearch clusters I help run had an interesting issue last week
around mappings and I wanted to get the communities thoughts about how to
handle it.
Issue:
Our cluster one morning went into utter chaos for no apparent reason. We
had nodes dropping constantly (master and data type nodes) and lots of
network exceptions in a our log files. The cluster kept going red from all
the dropped nodes and the cluster was totally unresponsive to external
commands.
Some Backgound:
Our cluster is fairly open to our users, meaning they can index what ever
they want without needing approval (this may have to change based on what
happened). The content stored is usually generated from .Net objects and
serialized using the Netwonsoft json serializer.
Cause:
After 6hrs of investigation while trying to get our cluster stable, this is
what we found:
We had a new document type (around 30,000 documents) indexed into the
cluster over a 1 hour window containing the .Net equivalent of a dictionary
in json format. When a dictionary is serialized to json, it ends up with a
json object containing a list of properties and values. The current
behavior of Elasticsearch is to generate a mapping definition for each
field name in a json object. So when you serialize a dictionary, it means
every 'key' in the dictionary gets its own mapping definition. It turns out
this can lead to nasty consequences when indexed in Elasticsearch...
Essentially, every document contained its own list of unique keys which
resulted in Elasticsearch generating mapping definitions for all the keys.
We found this out by noticing that the json type with the dictionary
continuously kept having is mappings updated (based on the master node log
files). The continual updating of the mappings (which is part of the
overall state file) caused the master nodes to lock up on the updates,
effectively stopping all other cluster operations. The state file upon
further investigation was over 70MB large by the time we ended up stopping
the cluster. Stopping the cluster was the only way to stop updates to the
mappings. The large mapping file we suspect was one of the major reasons
for nodes dropping; connections would timeout during the large file copy
(i'm assuming the state is passed around the nodes in the cluster).
Solution:
As previously mentioned we had to stop the cluster. We then had to make
sure that all indexing operations were stopped. Upon restarting the cluster
we deleted all documents of the poisonous document type (which took a
while). This resulted is a much smaller state file and a stable cluster.
Prevention:
So this is my real question for the community, what is the correct action
for preventing this in the future (or does it already exist). We could
obviously start more closely reviewing what goes into our cluster, but
should there be a feature in Elasticsearch to prevent this (assuming it
doesn't already exist)? I'm assuming that there are a number of users who
have clusters where they don't review everything that goes into their
cluster. So would it make sense to have Elasticsearch provide some feature
to prevent this issue, which is the equivalent to a DOS attack on the
cluster?
Thanks for reading this and I look forward to your responses!
-Josh Montgomery
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/23f0cc94-1cc7-4c8c-995c-c266dfbd40de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.