Question About Repository-hdfs

Hi All,

We have used the hdfs as our elasticsearch repositories to back up and restore our indices for several months. At first, all things seem go well. However, with the growth of data size and snapshot number, more and more small size files are generated. Now, the number of the small size files are over 7 millions.** As a result, the huge amount of size size files will affect the performance of the hdfs**. Therefore, I hope I can received the solution the our puzzling problems. We have the questions as followed.

  1. With regard to the growing small size files in hdfs repositories, what is the proposed solution to this problem?

  2. As far as we can do, we have tried two methods.

2.1) We have already tried to create a compress repository, which is like the following demo. However ,it will not reduce the number of the files and only mappings and settings files are compressed, not including data files. This method didn't solve the problem.

$ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{
"type": "hdfs",
"settings": {
"location": "/flightdev/backups/my_backup",
"compress": true
}
}'

2.2) As the backup precedure is incremental, so we have tried compressed the old existing backup files in 'har' format. However, if the the old existing backup files are compressed, the restore precedure will not work, and the backup progress will not be incremental, which is due to the fact that the repository-hdfs can't read the compressed files.
We do our backup snapshot work every 2 hour. In order to make the backup and restore snapshots work correctly, we may decompressed the .har files before doing every backup or restore snapshot work and then compressed them all together after finishing the backup or restore snapshot work. However, it will cost much time!

Therefore, with regard to the compressed files, is there any proposal solutions to the mentioned in 2.2 puzzling problems?

Thank you.

Best regards.

Do you have a defined retention policy for your snapshots? How long do you keep them? Do you keep all snapshots?

I don't have retention policy. And I keep all the snapshots and don't delete them. BTW, do you have the retention policy for snapshots?

Snapshots are not entirely incremental, so keeping all snapshots indefinitely is going to take up a lot of space, which is why I asked about retention period, e.g. how long you need to keep them. Snapshots work at the Lucene segment level, and if a segment has is already included in a previous snapshot when you create a new, it will not again be copied, which is why snapshots sometimes are referred to as incremental, even though this is not true at the document level.

As segments are immutable, Elasticsearch automatically merges segments into larger ones. If this happens and a new snapshot is taken, the new segment will also be copied. Old segments will however only be deleted once there are no more snapshots that reference them. Over time it is therefore likely that records will exist in multiple segments if old snapshots are never deleted.

I would therefore recommend you define a retention policy for your snapshots and stop keeping them around forever.

Thank you for your answers. I really appreciate it!

Here are the screen captures of elasticsearch data.
Picture 1 is the data in elasticsearch data node.
Picture 1:

Picture 2 is the data in HDFS repository.
Picture 2:

My Questions:

  1. In Picture 1, the files, such as _rp8*.*, are the Lucene segment files in es data node?
  2. In Picture 2, the files, such as __*rp, are the backup Lucene segment files in repository-hdfs?
  3. If the answers in Q1 and Q2 are both 'yes', what is the relastionship of the segment files in Q1 and Q2? (If this question is too complicated to answer, could you give me some guidelines to the solve the problems?)

Thank you very much from the bottom of my heart.

Elasticsearch is based on Lucene however one cannot just take the Lucene segments and put them in ES or vice-versa. Backups do contain meta information about the segments but both Q1 and Q2 are really meant for internal ES usage. In fact every major release ends up adding some bits here and there which has impact of compatibility that ES handles internally.

So all in all, while one can always look in ES and understand the format and use it, it is completely unsupported. If one wants access to data, getting the raw JSON from ES is the way forward.

Thank you. About the retention policy for snapshots, does the ElasticSearch itself support define retention policy? Or is there any plugin that supports the retention policy for snapshots?

One can use curator to delete old snapshots. ES itself does not actively seek out old snapshots and deletes them.