Elasticsearch as a atabase

Hi

I am considering using ES as a key-value storage (as a primary data store,
not cache). Is it suitable for this purpose?
Initially it looks really interesting - i like easy way of adding nodes,
thrift, realtime kv api and other features, however there are few topics
where documentation is not clear for me,
so i have few questions:

  1. What happens when a new document is indexed? Can i be sure that it will
    be stored on the disk even if a node would go down a millisecond later?
    Documentation mentions hat index only succeeds if "a quorum (>replicas/2+1)
    of active shards are available" - does it mean that it actually stores data
    on those nodes, or that they only need to be "not known to be down at the
    moment of index"?

  2. How to perform a backup and restore of whole index?

  3. How reliable would ES be when used for this purpose?

Thanks for any help

--

Hello,

On Wed, Dec 5, 2012 at 2:03 PM, Maciej Dziardziel fiedzia@gmail.com wrote:

Hi

I am considering using ES as a key-value storage (as a primary data store,
not cache). Is it suitable for this purpose?

I think it is, for most use cases. Of course, it depends on what features
you need and whether ES can provide them.

Initially it looks really interesting - i like easy way of adding nodes,
thrift, realtime kv api and other features, however there are few topics
where documentation is not clear for me,
so i have few questions:

  1. What happens when a new document is indexed? Can i be sure that it will
    be stored on the disk even if a node would go down a millisecond later?

Yes. You'll get an OK for your indexing operation once it has been stored
in your transaction log, which is flushed to your index at times (according
to your settings):

So even if the doc doesn't have a chance to get into your index, it can get
there after restarting the node.

Documentation mentions hat index only succeeds if "a quorum
(>replicas/2+1) of active shards are available" - does it mean that it
actually stores data
on those nodes,

Yes.

or that they only need to be "not known to be down at the moment of index"?

  1. How to perform a backup and restore of whole index?

You need to disable flush using the Indices Update Settings API:

Then copy what's in your data directory, then enable flush again.

To restore, you'd have to stop the same ES node, restore what you backed up
and restart the node. That might get tricky on a multiple-node cluster,
because you have to make sure that the data you restore doesn't conflict
with what is already in the cluster metadata.

If you want some code examples to begin with, here's something I used for
daily indices containing logs:

  1. How reliable would ES be when used for this purpose?

You can use replicas to make sure your cluster survives even if some
individual nodes go down. The nice part about this is that you can change
the number of replicas on a live cluster. And, searches run on replicas as
well. So if you add some more nodes and replicas, your system should handle
more concurrent queries, besides being more fault-tolerant.

If you need extra reliability, I think you should have a safety net of
either backing up your indices regularly, or keeping the source in another
store as well, so you can reindex it.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Hi,

As Radu pointed out already there are ways to achieve what you want,
actually we came up with an equal idea a while ago and used it successfully
(storing several terrabytes of data). Elasticsearch provides a lot of
features to make your store reliable, so I wouldnt worry here (actually we
never lost data in years due to problems in elasticsearch). Nevertheless,
to share our experience, we moved away from this, because when you just
want a keyvalue-store elasticsearch is, IMHO, a bit an overkill. You ask
what happens when a document is indexed, but do you want to index documents
or just store them? Indexing, with all steps included, takes time, just
storing a document is much faster ( we tried MongoDB, inserting speed is up
to 10 times faster). Also you have an overhead due to the index structure
(look at all the lucene files) when considering file size, might turn out
to be a problem when you are limited e.g. on ssd drives.
One point considering fetching would be to look at a routing strategy, so
you dont have to talk to the whole index but just to one shard, it for sure
improves performance.

When your scenario fits to it you can make it working in a reliable way, no
doubt here, but if you have special constraints on performance or number of
documents I would think about what is actually information you want to
search on and what to just store, it might be more effective to separate
it. Stores are good for storing, search solutions for searching :wink:

Greets
Andrej

--

On Wednesday, December 5, 2012 12:28:25 PM UTC, Radu Gheorghe wrote:

Hello,

Initially it looks really interesting - i like easy way of adding nodes,
thrift, realtime kv api and other features, however there are few topics
where documentation is not clear for me,
so i have few questions:

  1. What happens when a new document is indexed? Can i be sure that it
    will be stored on the disk even if a node would go down a millisecond later?

Yes. You'll get an OK for your indexing operation once it has been stored
in your transaction log, which is flushed to your index at times (according
to your settings):
Elasticsearch Platform — Find real-time answers at scale | Elastic

That's clear now - thanks.

  1. How to perform a backup and restore of whole index?

You need to disable flush using the Indices Update Settings API:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Then copy what's in your data directory, then enable flush again.

But this only copy data from single node. My intention is to split it
between several nodes. Is there any other way,
or do i need to backup all nodes separately?

  1. How reliable would ES be when used for this purpose?

You can use replicas to make sure your cluster survives even if some
individual nodes go down. The nice part about this is that you can change
the number of replicas on a live cluster. And, searches run on replicas as
well. So if you add some more nodes and replicas, your system should handle
more concurrent queries, besides being more fault-tolerant.

If you need extra reliability, I think you should have a safety net of
either backing up your indices regularly, or keeping the source in another
store as well, so you can reindex it.

My plan is to use it as the only storage do regular backups,

Thanks for your help.

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

On Wednesday, December 5, 2012 12:55:06 PM UTC, Andrej Rosenheinrich wrote:

Hi,

As Radu pointed out already there are ways to achieve what you want,
actually we came up with an equal idea a while ago and used it successfully
(storing several terrabytes of data). Elasticsearch provides a lot of
features to make your store reliable, so I wouldnt worry here (actually we
never lost data in years due to problems in elasticsearch). Nevertheless,
to share our experience, we moved away from this, because when you just
want a keyvalue-store elasticsearch is, IMHO, a bit an overkill. You ask
what happens when a document is indexed, but do you want to index documents
or just store them?

Just store, as pure k/v storage.

Indexing, with all steps included, takes time, just storing a document is
much faster ( we tried MongoDB, inserting speed is up to 10 times faster).
Also you have an overhead due to the index structure (look at all the
lucene files) when considering file size, might turn out to be a problem
when you are limited e.g. on ssd drives.

Can you elaborate on that? I am not familiar with lucene. What overhead can
i expect? How it compares with mongo in your case?

One point considering fetching would be to look at a routing strategy, so
you dont have to talk to the whole index but just to one shard, it for sure
improves performance.

Is there anything i need to do? Documentation mentions that there is a
default routing based on hash of _id field, so it seems i don't need to
worry about that.

When your scenario fits to it you can make it working in a reliable way,
no doubt here, but if you have special constraints on performance or number
of documents I would think about what is actually information you want to
search on and what to just store, it might be more effective to separate
it. Stores are good for storing, search solutions for searching :wink:

Yes, i keep hearing that. However i am going to use ES for searching, and
if could use it for k/v as well, that would greatly simplified my
environment - unless of course ES is completely unsuitable for this
purpose, or there is something
better. For now i'd choose it over mongo for easier maintenance, and over
riak for reach set of operations. I'll run benchmarks to be sure it does
what i need, but it seems to meet my requirements.

--

Hello,

On Wed, Dec 5, 2012 at 3:00 PM, Maciej Dziardziel fiedzia@gmail.com wrote:

On Wednesday, December 5, 2012 12:28:25 PM UTC, Radu Gheorghe wrote:

Hello,

Initially it looks really interesting - i like easy way of adding nodes,
thrift, realtime kv api and other features, however there are few topics
where documentation is not clear for me,
so i have few questions:

  1. What happens when a new document is indexed? Can i be sure that it
    will be stored on the disk even if a node would go down a millisecond later?

Yes. You'll get an OK for your indexing operation once it has been stored
in your transaction log, which is flushed to your index at times (according
to your settings):
Elasticsearch Platform — Find real-time answers at scale | Elastic**
translog.htmlhttp://www.elasticsearch.org/guide/reference/index-modules/translog.html

That's clear now - thanks.

  1. How to perform a backup and restore of whole index?

You need to disable flush using the Indices Update Settings API:
Elasticsearch Platform — Find real-time answers at scale | Elastic**
indices-update-settings.htmlhttp://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html

Then copy what's in your data directory, then enable flush again.

But this only copy data from single node. My intention is to split it
between several nodes. Is there any other way,
or do i need to backup all nodes separately?

You can either backup all your nodes separately, or you can do a scroll[0]
through large portions of your data (maybe complete indices) and get the
source stored or indexed somewhere else as well, so you can reindex it back
if you need to.

Backing up your data directory will back up your whole index (data +
index), as opposed to only your initial data.

[0] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

I'm not familiar with your key/value data structure, there is so much that
depends on the data. But here are some general remarks.

If you only store data and never index and never search with queries, you
don't really use Elasticsearch to its full extent.

Elasticsearch is built upon Lucene, which is an inverted index, not a
key/value store. Right now, each document will get indexed / stored again
if just a single field value in a document changes, and a single value
change triggers a write to the whole Lucene index files. This is a large
I/O overhead compared to writes into a key/value store. With Lucene 4,
Elasticsearch will be able to offer DocValues, which will improve the
situation a lot. Note, MongoDB offers update-in-place, at the downside it
comes with massive space overhead.

Be also aware of the features you may want in Elasticsearch, for example
storing blobs, if the values are large and opaque in your key/value
store. Elasticsearch speaks JSON over HTTP REST, so you have to serialize
each blob to plain text. There is no method yet of storing binary objects
unencoded, only by using base64 (and optionally compressed).

Jörg

--

If you choose to use ES only as a key / value store, you can consider
to set index to no in your mappings. This way the fields in your
json document source don't get indexed into the inverted index. This
lowers the mentioned I/O overhead.

On 5 December 2012 19:32, Jörg Prante joergprante@gmail.com wrote:

I'm not familiar with your key/value data structure, there is so much that
depends on the data. But here are some general remarks.

If you only store data and never index and never search with queries, you
don't really use Elasticsearch to its full extent.

Elasticsearch is built upon Lucene, which is an inverted index, not a
key/value store. Right now, each document will get indexed / stored again if
just a single field value in a document changes, and a single value change
triggers a write to the whole Lucene index files. This is a large I/O
overhead compared to writes into a key/value store. With Lucene 4,
Elasticsearch will be able to offer DocValues, which will improve the
situation a lot. Note, MongoDB offers update-in-place, at the downside it
comes with massive space overhead.

Be also aware of the features you may want in Elasticsearch, for example
storing blobs, if the values are large and opaque in your key/value store.
Elasticsearch speaks JSON over HTTP REST, so you have to serialize each blob
to plain text. There is no method yet of storing binary objects unencoded,
only by using base64 (and optionally compressed).

Jörg

--

--
Met vriendelijke groet,

Martijn van Groningen

--

On Wednesday, December 5, 2012 6:32:58 PM UTC, Jörg Prante wrote:

Be also aware of the features you may want in Elasticsearch, for example
storing blobs, if the values are large and opaque in your key/value
store. Elasticsearch speaks JSON over HTTP REST, so you have to serialize
each blob to plain text. There is no method yet of storing binary objects
unencoded, only by using base64 (and optionally compressed).

That's ok for me, i'll only store text anyway. As for protocol
serialization, ES supports thrift, so this is not a problem.

Jörg

--

On Thursday, December 6, 2012 5:32:36 PM UTC, Martijn v Groningen wrote:

If you choose to use ES only as a key / value store, you can consider
to set index to no in your mappings. This way the fields in your
json document source don't get indexed into the inverted index. This
lowers the mentioned I/O overhead.

That makes sense,thanks.

--