Dec 3rd, 2020: [EN] Cross version Elasticsearch data migration with ESM

Why ESM

Hey there,

I heard that you are using Elasticsearch, that's great, as you know for search, it is the best choice, and it is evolving very fast. There are so many new and nice features coming up or already that i guess you can't wait to upgrade to the latest version, right?

Good! So how do we upgrade or migrate existing data to the new cluster? Let's explore some options that we have.

  • Snapshot & Restore: first rule, always remember to backup your data before upgrading. Snapshot is the first choice, for migrations. You may just need to start a new cluster, configure to use the same snapshot repository and restore the data on new cluster, run some tests and then decide to finish the left migration steps or not. Snapshot is good but you may need extra efforts to jump major versions.
  • Rolling upgrade: another preferred method to upgrade your cluster without stopping your business. It should be easy to perform minor version upgrade, or you need to upgrade the latest version of your current major version in order to upgrade to the next major version, and you can't roll back to old version. And it may not work for older versions like 5.x or 2.x or even older.
  • Reindex: the way elasticsearch built-in , easy and great. Only one thing you need to do is configure the reindex.remote.whitelist before use. It's a static setting though and requires a restart.
  • Export and Import: plain json data traveling between clusters, you can export elasticsearch's data to local files, and import then lately, good fit for small volume size, there are bunch of tools provide by our awesome community.

So, there are already bunch of choices, what's your suggestion?

If you are already familiar with Reindex and Snapshot, go use them as the first choice, but if your data size is less then 500GB~1TB, and you need a quick way to get your migration done, try ESM:

  • It's fast, as a reference, one customer use ESM to migrate billion documents in hours rather than days compare to other solutions.
  • ESM is written in Golang and comes with binaries then. You don't need to worry about NodeJS, Python, Java or any other dependency problems.
  • ESM supports cross version Elasticsearch data migration, the differences in different versions of Elasticsearch are handled in order to do scroll and bulk operations.

How to use it

Install is quite easy. All you need is to download the latest release of ESM from here:

Install

Download the package that suits to your platform and extract the executable binary. Then you are good to go.

# For Mac OSx
โžœ  wget https://github.com/medcl/esm/releases/download/v0.4.5/darwin64.tar.gz

# For Windows 
# https://github.com/medcl/esm/releases/download/v0.4.5/windows64.tar.gz
# For Linux
# https://github.com/medcl/esm/releases/download/v0.4.5/linux64.tar.gz

# Extract files
โžœ  tar vxzf darwin64.tar.gz

Run ESM

Let's take a quick look about how it looks like when you do the migration with ESM.

โžœ  ./esm -s https://elastic:pass@eshost:9343 -d https://elastic:pass@eshost:8000 -x medcl2 -y medcl23 -r -w 200 --sliced_scroll_size=40 -b 10 -t=30m
[11-12 22:09:38] [INF] [main.go:461,main] start data migration..
Scroll 20377840 / 20387840 [=====================================================================================================================]  99.95% 1m20s
Bulk 20371785 / 20387840 [=======================================================================================================================]  99.92% 1m53s
[11-12 22:11:32] [INF] [main.go:492,main] data migration finished.

20 million documents are migrated within 2 minutes, seems not bad. :muscle:

More options

There are some nice features provided as parameters, you can choose based on your needs.

โžœ  darwin64 ./esm --help
Usage:
  esm [OPTIONS]

Application Options:
  -s, --source=                    source elasticsearch instance, ie: http://localhost:9200
  -d, --dest=                      destination elasticsearch instance, ie: http://localhost:9201
  -m, --source_auth=               basic auth of source elasticsearch instance, ie: user:pass
  -n, --dest_auth=                 basic auth of target elasticsearch instance, ie: user:pass
  	  --source_proxy=              set proxy to source http connections, ie: http://127.0.0.1:8080
      --dest_proxy=                set proxy to target http connections, ie: http://127.0.0.1:8080
  -x, --src_indexes=               indexes name to copy, support regex and comma separated list (_all)
  -y, --dest_index=                indexes name to save, allow only one indexname, original indexname will be used if not specified
  -q, --query=                     query against source elasticsearch instance, filter data before migrate, ie: name:medcl
  -c, --count=                     number of documents at a time: ie "size" in the scroll request (10000)
  -w, --workers=                   concurrency number for bulk workers (1)
  -b, --bulk_size=                 bulk size in MB (5)
  -t, --time=                      scroll time (1m)
      --sliced_scroll_size=        size of sliced scroll, to make it work, the size should be > 1 (1)
  -f, --force                      delete destination index before copying
  -a, --all                        copy indexes starting with . and _
      --copy_settings              copy index settings from source
      --copy_mappings              copy index mappings from source
      --shards=                    set a number of shards on newly created indexes
  -u, --type_override=             override type name
  -v, --log=                       setting log level,options:trace,debug,info,warn,error (INFO)
  -i, --input_file=                indexing from local dump file
  -o, --output_file=               output documents of source index into local file
      --input_file_type=           the data type of input file, options: dump, json_line, json_array, log_line (dump)
      --refresh                    refresh after migration finished
      --green                      wait for both hosts cluster status to be green before dump. otherwise yellow is okay
      --fields=                    filter source fields, comma separated, ie: col1,col2,col3,...
      --rename=                    rename source fields, comma separated, ie: _type:type, name:myname
  -l, --logstash_endpoint=         target logstash tcp endpoint, ie: 127.0.0.1:5055
      --secured_logstash_endpoint  target logstash tcp endpoint was secured by TLS
      --repeat_times=              repeat the data from source N times to dest output, use align with parameter regenerate_id to amplify the data size
  -r, --regenerate_id              regenerate id for documents, this will override the exist document id in data source

Fix type

One common issue we need to fix is when you upgrade your elasticsearch from a very old version, like 5.x to 7.x. You may need to cleanup and unify these document types, as after 6.x, there is only one type, there is a short time period the default type was changed to doc and now changed to _doc. Using ESM, you can migrate 5.x documents to 7.x like this:

./esm -s http://localhost:9201 -x "source_index" -y "target_index"  -d https://localhost:9200 --rename="_type:type,age:myage"  -u"_doc"

The parameter -u is used to override the document type to a new type _doc which has no practical meaning but used to align with new version, which will going to remove finally in Elasticsearch. Yes, you may still need your document's type name. No worries, use parameter rename to keep the original _type to a new field type, which you can use it to filter out your documents lately.

Speedup migration

To speedup migration with ESM, there are two parts to consider: scroll and bulk.

ESM is using the scroll API and bulk API under the hood. To speed up scroll, consider setting proper sliced_scroll_size and the scroll count. To speed up bulk, consider setting larger workers and proper bulk_size. There is no good defaults. It always depends. but it is easy to find what exactly you need to tune.

There will be two progress bar when running the migration with ESM.

  • If the bulk bar can't catch up with the scroll bar, then the bottleneck is with bulk. Consider increasing the workers.
  • If the bulk is catching up the scroll bar easily, consider increasing sliced_scroll_size and count.
  • If the host running ESM or Elasticsearch is burning CPU at 100%, then there is nothing you can do anymore, let it go and wait it to finish. :stuck_out_tongue_closed_eyes:

Offline migration

ESM support dumping documents to local files. So you can load them to new cluster later.

Firstly, dump to local. Use -o to specify the file location for the export. You can also use a query string to filter out the subset of the documents. That is super useful to do incremental migration:

./esm -s https://192.168.3.98:9200 -m elastic:password -o json.out -x kibana_sample_data_ecommerce -q "order_date:[2020-02-01T21:59:02+00:00 TO 2020-03-01T21:59:02+00:00]"

Then, use ESM to load the file and send data to the new cluster:

./bin/esm -i json.out -d  http://localhost:9201 -y target-index1

Go wild with ESM

If you just wanna do a quick benchmark testing with your own dataset, and you don't have enough testing data, you can use ESM to amplify the existing dataset by regenerating the document _id:

./bin/esm -i input.json -d  http://localhost:9201 -y target-index1  --regenerate_id  --repeat_times=1000 

Note that for benchmark testing, Rally is strongly recommended since it can reproduce load scenarios in a consistent way.

Common issues

If you run with OOM, you should consider increasing the host memory to run ESM.
Because ESM uses memory as a buffer layer to speedup the migration, or you may lower down the speed by setting smaller workers or scroll batch size.

If you got this exception:

[08-21 09:08:31] [ERR] [scroll.go:49,Next] {"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."}],"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."},"status":400}

That usually means you have too many shards with your scroll requests, you can fix it by modifying elasticsearch.yml file as below:

http.max_header_size: 16k
http.max_initial_line_length: 8k

An elasticsearch restart is required in that case. :pensive:

Conclusion

ESM is a simple tool that can be used to do elasticsearch data migration. I hope you will like it.
Cheers and have fun with your data!

4 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.