Evaluating ES and need some input


(Beau Keogh) #1

Hi all, I'm evaluating ES for our project and am wondering if it's a good
idea to use it for storage and indexing, or should we store the documents
elsewhere and use it just for indexing. We're going to be storing and
indexing data that clients commit so it's absolutely imperative that we
don't lose anything if the ES index goes south.

Should we store the documents in say S3 and index in ES, or is there a 100%
safe way to store documents in ES? Or should I say, a 100% reliable way to
recover if/when something goes wrong...

If an index does get corrupted which scenario will offer the best recovery
options?

Also, thinking about having 1 index for each client. Any notable pros/cons
to setting things up that way? Or should we do one index and reference
documents by clientid?

Thanks!


(Igor Motov) #2

You might find this helpful:
http://stackoverflow.com/questions/6636508/elasticsearch-as-a-database

On Tuesday, April 10, 2012 4:37:09 PM UTC-4, Beau Keogh wrote:

Hi all, I'm evaluating ES for our project and am wondering if it's a good
idea to use it for storage and indexing, or should we store the documents
elsewhere and use it just for indexing. We're going to be storing and
indexing data that clients commit so it's absolutely imperative that we
don't lose anything if the ES index goes south.

Should we store the documents in say S3 and index in ES, or is there a
100% safe way to store documents in ES? Or should I say, a 100% reliable
way to recover if/when something goes wrong...

If an index does get corrupted which scenario will offer the best recovery
options?

Also, thinking about having 1 index for each client. Any notable pros/cons
to setting things up that way? Or should we do one index and reference
documents by clientid?

Thanks!


(Shay Banon) #3

My recommendation is to also have an option to reindex the data when using
elasticsearch, at least for the time being. The aim is definitely to
eventually be a stable storage layer, but I recommend either doing backups
or storing the data elsewhere for now.

Regarding using an index per user, here is a good thread:
https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ
.

On Tue, Apr 10, 2012 at 11:37 PM, Beau Keogh beaukeogh@gmail.com wrote:

Hi all, I'm evaluating ES for our project and am wondering if it's a good
idea to use it for storage and indexing, or should we store the documents
elsewhere and use it just for indexing. We're going to be storing and
indexing data that clients commit so it's absolutely imperative that we
don't lose anything if the ES index goes south.

Should we store the documents in say S3 and index in ES, or is there a
100% safe way to store documents in ES? Or should I say, a 100% reliable
way to recover if/when something goes wrong...

If an index does get corrupted which scenario will offer the best recovery
options?

Also, thinking about having 1 index for each client. Any notable pros/cons
to setting things up that way? Or should we do one index and reference
documents by clientid?

Thanks!


(Beau Keogh) #4

Thanks, that's very helpful.

So the post says: "A "user" based data flow, in theory, is perfect for an
index per user case. If you have enough nodes (each shard is a Lucene
index, which has a cost) in the cluster to do that, thats great, and
several very large scale ES users actually do that."

What is very large scale? How many nodes would you recommend in order to
support 100, 500, 1000, or 5000 clients using an "index per client" setup
where index size could range from 0 to 15,000,000 documents which are
typically 10-100 kb in size.


(Shay Banon) #5

Its really hard to tell based on the numbers you gave, since it also
relates to what type of queries you execute, faceting / sorting, and what
type of machines you have the nodes are running on.

On Thu, Apr 12, 2012 at 12:38 AM, Beau Keogh beaukeogh@gmail.com wrote:

Thanks, that's very helpful.

So the post says: "A "user" based data flow, in theory, is perfect for an
index per user case. If you have enough nodes (each shard is a Lucene
index, which has a cost) in the cluster to do that, thats great, and
several very large scale ES users actually do that."

What is very large scale? How many nodes would you recommend in order to
support 100, 500, 1000, or 5000 clients using an "index per client" setup
where index size could range from 0 to 15,000,000 documents which are
typically 10-100 kb in size.


(system) #6