Dec 15th, 2017: [EN][Elasticsearch] Going from multiple types to one type with the Reindex API


(Abdon Pijpelink) #1

Every document in Elasticsearch has a type. It has long been the recommendation of Elastic to use only one document type per index, but with the release of version 6.0 this has become more than just advice. For new indices, Elasticsearch now only accepts one document type per index, as a first step to the complete removal of document types in future versions of Elasticsearch.

The fact that Elasticsearch now only accepts a single type per index makes this a good time to think about how you would migrate indices with multiple types to indices with a single type. In this post we want to talk about the Elasticsearch Reindex API as a tool to help with that migration. The Reindex API helps you get documents from one or more indices into a new index.

A call to Reindex can be combined with a script. This script will then allow you to transform the documents as they get reindexed, for example by adding or removing fields. Scripts also allow you to change a document’s metadata fields, like _id, _type and _index. You can use that to change a document’s type, for example.

Let’s say you have got two document types in a single index called old:

  • a company type
  • an employee type

You could use the reindex API to reindex all documents to a new index, and combine that with a script that changes all of the types of your documents into a single type. The convention we use at Elastic is to use a type called doc:

POST _reindex
{
  "source": {
    "index": "old"
  },
  "dest": {
    "index": "new"
  },
  "script": {
    "source": """
        ctx._source.type = ctx._type;
        ctx._type = 'doc';
    """,
    "lang": "painless"
  }
}

Great, all documents now have a single type! The idea here is that you would have created the target index new before you ran the _reindex command, defining the mappings of the doc type (being the merged mappings of the original types).

The script also adds the original value of _type as the value of a new field type. You can use that field to filter on specific document types in your queries.

However, the above only works when all documents in the old index have a globally unique _id, i.e. when documents of different types do not share the same _id. However, your documents of different types may very well share the same _id. In that case, what you could do is change the _id of the documents using the script. You could for example prepend the original _type to the _id. An employee with an _id of 1 would get employee_1 as its new _id. In order to do so, you would change the above script into:

ctx._id = ctx._type + "_" + ctx._id;
ctx._source.type = ctx._type;
ctx._type = 'doc';

Alternatively, you could create a separate index for each of the document types. You could name your new indices based on the original _type of the documents that they will contain. In our example we would end up with two indices: new_company and new_employee. The script we could use to accomplish that:

ctx._index = "new_" + ctx._type;

So far so good, but what if you were using a parent/child relationship between your employee and company documents? The way parent/child relationships have to be set up has changed as of version 6.0. You now have to use a field of type join (introduced in version 5.6) which defines the relationship between documents. Parent and child documents will still need to live in the same index, but they will have to be of the same type.

The first thing you would need to do is to set up a new index with the join field type in your mappings. If you do this on Elasticsearch 5.6, you will need to explicitly configure your index to be a single-type index using the setting "index.mapping.single_type": true:

PUT new
{
  "settings": {
    "index.mapping.single_type": true
  },
  "mappings": {
    "doc": {
      "properties": {
        "join": {
          "type": "join",
          "relations": {
            "company": "employee"
          }
        }
      }
    }
  }
}

Our script now becomes a bit more complex. We add a field join.name to our documents that gets the original _type as its value. For the child documents, we also provide the _id of the parent document as the values of join.parent and _routing:

String parentType = "company";

ctx._source.join = new HashMap();
ctx._source.join.name = ctx._type;

ctx._id = ctx._type + "_" + ctx._id;

// Only child documents have a value for _parent
if (ctx._parent != null)
{
  String routing = parentType + "_" + ctx._parent;
      
  ctx._source.join.parent = routing;
  ctx._routing = routing;
        
  ctx._parent = null;
}

ctx._type = 'doc';  

… et voilà: all of our parent/child documents now happily live in the same index as a single type, conforming with the new version 6.x structure.

Hopefully this post has given you some idea of the power that the combination of the Reindex API and scripting gives you, when preparing your indices for the migration to version 6.x. The recipes above work on version 5.6 and therefore allow you to reindex your data to new indices, before migrating those indices to 6.x. Or you could use the scripts in combination with the Reindex from remote functionality to pull data from an existing 5.x cluster to a new 6.x cluster.


Reindex from Remote and forgetting ES 2.1 type
Rejecting mapping update
(eliasah) #2

This is excellent. If I may ask, why isn't it part of the official documentation ?


(Mark Walkom) #3