[Ann] Easy importing/exporting indices with the Knapsack plugin

Hi,

I have written a very simple and easy-to-use importing/exporting tool for
Elasticsearch, the Knapsack plugin.

How does it work?

Knapsack depends on the _source field being enabled. By executing an
_export command via curl HTTP POST, the export starts and performs a
background threaded loop of scan/scroll request/response over the index.
Each document _source is written as a tar entry to a tar archive file in
the local file system. By default, the tar archive is gzipped.

Also, the index setting and mapping is written into the tar.

The tar entry name consists of index/type/id, and the name of the tar
archive consists of '_' '.tar.gz'

Importing works exactly the other way round. Put a tar archive into the
file system, and import it via _import with curl HTTP POST. A background
thread is spawned and the tar archive is bulk-inserted into Elasticsearch.
By file renaming, you can even copy indices.

Note, it is not a full-fledged backup/restore solution, it is just a toy,
with many features not (yet) present. It can help in situations where you
want to offload indices quickly and transfer it to another place, e.g. for
bug hunting, or for testing.

So, if your indices want to take a hike, maybe across Elasticsearch
clusters, put the knapsack on and enjoy!

Cheers,

Jörg

--

Very nice. I will play with it!

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 déc. 2012 à 17:07, Jörg Prante joergprante@gmail.com a écrit :

Hi,

I have written a very simple and easy-to-use importing/exporting tool for Elasticsearch, the Knapsack plugin.

How does it work?

Knapsack depends on the _source field being enabled. By executing an _export command via curl HTTP POST, the export starts and performs a background threaded loop of scan/scroll request/response over the index. Each document _source is written as a tar entry to a tar archive file in the local file system. By default, the tar archive is gzipped.

Also, the index setting and mapping is written into the tar.

The tar entry name consists of index/type/id, and the name of the tar archive consists of '_' '.tar.gz'

Importing works exactly the other way round. Put a tar archive into the file system, and import it via _import with curl HTTP POST. A background thread is spawned and the tar archive is bulk-inserted into Elasticsearch. By file renaming, you can even copy indices.

Note, it is not a full-fledged backup/restore solution, it is just a toy, with many features not (yet) present. It can help in situations where you want to offload indices quickly and transfer it to another place, e.g. for bug hunting, or for testing.

So, if your indices want to take a hike, maybe across Elasticsearch clusters, put the knapsack on and enjoy!

Cheers,

Jörg

--

--

Hi Jörg,

Very nice. How come you went for export+import and not a straight
scan/scroll=>index?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html

On Sunday, December 9, 2012 11:07:16 AM UTC-5, Jörg Prante wrote:

Hi,

I have written a very simple and easy-to-use importing/exporting tool for
Elasticsearch, the Knapsack plugin.

How does it work?

Knapsack depends on the _source field being enabled. By executing an
_export command via curl HTTP POST, the export starts and performs a
background threaded loop of scan/scroll request/response over the index.
Each document _source is written as a tar entry to a tar archive file in
the local file system. By default, the tar archive is gzipped.

Also, the index setting and mapping is written into the tar.

The tar entry name consists of index/type/id, and the name of the tar
archive consists of '_' '.tar.gz'

Importing works exactly the other way round. Put a tar archive into the
file system, and import it via _import with curl HTTP POST. A background
thread is spawned and the tar archive is bulk-inserted into Elasticsearch.
By file renaming, you can even copy indices.

Note, it is not a full-fledged backup/restore solution, it is just a toy,
with many features not (yet) present. It can help in situations where you
want to offload indices quickly and transfer it to another place, e.g. for
bug hunting, or for testing.

So, if your indices want to take a hike, maybe across Elasticsearch
clusters, put the knapsack on and enjoy!

https://github.com/jprante/elasticsearch-knapsack

Cheers,

Jörg

--

Well, a straight scan/scroll->index is already there, in the reindex plugin
:slight_smile:

My primary intention was being able to create tar archive files from an
index, for transporting the archive to another Elasticsearch instance.

Jörg

On Monday, December 10, 2012 5:40:32 AM UTC+1, Otis Gospodnetic wrote:

Hi Jörg,

Very nice. How come you went for export+import and not a straight
scan/scroll=>index?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html

On Sunday, December 9, 2012 11:07:16 AM UTC-5, Jörg Prante wrote:

Hi,

I have written a very simple and easy-to-use importing/exporting tool for
Elasticsearch, the Knapsack plugin.

How does it work?

Knapsack depends on the _source field being enabled. By executing an
_export command via curl HTTP POST, the export starts and performs a
background threaded loop of scan/scroll request/response over the index.
Each document _source is written as a tar entry to a tar archive file in
the local file system. By default, the tar archive is gzipped.

Also, the index setting and mapping is written into the tar.

The tar entry name consists of index/type/id, and the name of the tar
archive consists of '_' '.tar.gz'

Importing works exactly the other way round. Put a tar archive into the
file system, and import it via _import with curl HTTP POST. A background
thread is spawned and the tar archive is bulk-inserted into Elasticsearch.
By file renaming, you can even copy indices.

Note, it is not a full-fledged backup/restore solution, it is just a toy,
with many features not (yet) present. It can help in situations where you
want to offload indices quickly and transfer it to another place, e.g. for
bug hunting, or for testing.

So, if your indices want to take a hike, maybe across Elasticsearch
clusters, put the knapsack on and enjoy!

https://github.com/jprante/elasticsearch-knapsack

Cheers,

Jörg

--

So I have been looking to do something like this but was thinking I would just copy the local directory structures from one cluster to another. As long as the number of nodes match it should just work. Did you look into this?

Thanks

--

Of course I looked into the copy approach. But that approach has some
limitations.

Beside the same number of nodes, you need the same ES version, and the
guarantee that local directory structures had been flushed into a
consistent state. And, you have binary Lucene indexes plus ES control
files, which is somewhat cryptic to get opened and processed by non-Lucene
tools.

My challenge is offloading JSON data for postprocessing, and also moving it
to other ES installations for reloading, no matter how many nodes or what
version :slight_smile:

Imagine the following use cases: create index dumps for remote analysis,
preparing a set of index data for reproducible benchmarks, or packaging
aggregated data for further processing outside ES which has been harvested
from many places by many methods over a while.

Jörg

--

Jörg,

I think this is great plugin! I can think of interesting use cases. Thanks
for sharing.

Lukáš
Dne 11.12.2012 2:01 "Jörg Prante" joergprante@gmail.com napsal(a):

Of course I looked into the copy approach. But that approach has some
limitations.

Beside the same number of nodes, you need the same ES version, and the
guarantee that local directory structures had been flushed into a
consistent state. And, you have binary Lucene indexes plus ES control
files, which is somewhat cryptic to get opened and processed by non-Lucene
tools.

My challenge is offloading JSON data for postprocessing, and also moving
it to other ES installations for reloading, no matter how many nodes or
what version :slight_smile:

Imagine the following use cases: create index dumps for remote analysis,
preparing a set of index data for reproducible benchmarks, or packaging
aggregated data for further processing outside ES which has been harvested
from many places by many methods over a while.

Jörg

--

--