Dealing with dots in fields for the 2.0 upgrade

When running the migration plugin, pretty much every index came back as failed, due to us having a template that automatically added an extra FIELD_NAME.raw field that was no analyzed (See: https://github.com/elastic/logstash/blob/v1.3.1/lib/logstash/outputs/elasticsearch/elasticsearch-template.json). I corrected this template for future indexes, but each of my existing indexes have offending fields. The only information I can really find is "reindex them", which seems like it is forcing me to do a lot of work to conform to the update. Regardless, I cannot find any information on how one would reindex while renaming fields to a new field, without pulling down every document, swapping out the field, and inserting it into a new index. I have almost 2k indexes and a few TBs of data, and while I could write a program to do this, there has to be a better way, right?

Man I wish I could tell you this were easier.

First a moment of background. From a Logstash perspective, Aaron wrote a stickied post over here about it. They added a 'de_dot' filter to help with situations where you may not have control over the source fields. Dots in field names also caused problems in Kibana. And they could cause really weird problems with field ambiguity, as seen here. Ugh... what a mess ever allowing them caused.

I know none of that helps your current situation, because the unfortunate answer is you're going to have to reindex. This is one of those breaking changes that just had to happen. We looked into ways to simplify the process to avoid a reindex or do it in place, but the options just weren't tenable unfortunately.

A few of the elasticsearch clients do have "reindex" functions built in. I've written this script which uses the perl reindex API, which you can see as an example.

I appreciate the script, but it does not cut it. It reindexes 1000 docs at a time; I have 5 billion documents.

This is going to sound harsh, and I know this going to be unpopular, but you guys dropped the ball with this. I am in this situation, not because I didn't think out my naming conventions ahead of time, but because the Logstash team introduced these fields as part of their template. I am in this situation because Elastic:

A. released an update that added these fields and
B. released another update that breaks on any data touched by A.

So why isn't Elastic offering a tool to migrate this data? Because from where I am standing, it seems to be that I am being told "Hey Brian, we caused you to be in a situation that prevents you from upgrading. Good luck fixing it on your own!" This is extremely unprofessional of you.

I have been spending cycles trying to reindex my data (again, 5 billion documents), and I guess I am dumb because there seems to be ZERO easy way to do it. At this point, why bother even using Elasticsearch if we are barred from ever upgrading in our current state?

Obviously I am upset, so again, sorry if this seems harsh. I just feel like I am spending my time doing something that Elastic should have been doing.

I hear you on the pain, but any solution here would require a reindex. The best we can do is help you with tools that assist with the reindex. The script is one example and, if you're familiar with Logstash already, you can use the de-dot filter that was added there as an easy tool to assist with the process. See https://gist.github.com/markwalkom/8a7201e3f6ea4354ae06 for details of how to configure Logstash as a reindex tool.

In any case of what tool you decide to use, it sounds to me (correct me if I'm mishearing you) that your frustration is tied to the size of your index. That is, if you had a few million docs, either of the above two would chew through a thousand results at a time and be done pretty quickly, hardware permitting. It may be the case that the script or using the Logstash filter above may go very quick even on 5 billion documents. 1000 docs per batch may be fine, for example, if you get a throughput of 100 batches per second. But I'll leave the profiling aside because any case, there may be some things that may be done to help speed it up:

  • Update your translog fsync settings on the receiving cluster during the bulk reindex. ES 2.0 fsyncs by default on every request, but you can disable that
  • Use a different batch size: 1000 may not be optimal. This is not a 1-size-fits-all, as it varies by your hardware
  • Run multiple instances of the reindexer of your choosing with the de-dot filter in parallel while providing each with a unique, non-overlapping query. This can be done in logstash by using the query parameter. The same can be done with the script by adding a query body into the scroll. I've updated the script to show how to do this (line 49). Once you've split your data up a bit by query, you can then run several of them at the same time. Whether this helps or not will depend, again, on your hardware

To add insult to injury, Elasticsearch works perfectly fine with dots in
field names: https://github.com/elastic/elasticsearch/issues/14957

Dots were removed because they could cause ambiguity in the mapping. Better
to remove a feature since someone might shoot themselves in the foot. You
could run a patched version of Elasticsearch with the check in the mapper
service removed, and it will work perfectly fine.

So, after spending weeks reindexing to fix the issue introduced by the Logstash team, the migration tool finally clears me for upgrade. So I schedule a production window, take the cluster down, and...no, turns out the migration tool is a giant lie. Closed indexes, which are not checked by the migration tool, are checked by Elasticsearch 2.0.0 when it starts up! Therefore, without opening every closed index, checking them, and closing them again, there is ZERO way to upgrade.

From where I am standing, and I know this is unpopular, but the 2.0.0 launch is straight-up embarrassing. There has been no regard for the end-user at all, and it is laughable that, have working for weeks to prep for the upgrade, I still am barred from upgrading! I am currently in a state where there is no resolution, and it is looking likely that there will be none. Further, it is frustrating that all of this 'incorrect' data was directly caused by Elastic code! I honestly can't think of a larger deployment mess.

My advice for your next upgrade: remember that you do have existing users. For large-scale changes, use depreciation warnings, provide tools to actually migration data, and actually make sure your verification tools work. Don't leave your users out in the cold to fend for themselves, like this situation.