Hi all, I'm evaluating ES for our project and am wondering if it's a good
idea to use it for storage and indexing, or should we store the documents
elsewhere and use it just for indexing. We're going to be storing and
indexing data that clients commit so it's absolutely imperative that we
don't lose anything if the ES index goes south.
Should we store the documents in say S3 and index in ES, or is there a 100%
safe way to store documents in ES? Or should I say, a 100% reliable way to
recover if/when something goes wrong...
If an index does get corrupted which scenario will offer the best recovery
options?
Also, thinking about having 1 index for each client. Any notable pros/cons
to setting things up that way? Or should we do one index and reference
documents by clientid?
On Tuesday, April 10, 2012 4:37:09 PM UTC-4, Beau Keogh wrote:
Hi all, I'm evaluating ES for our project and am wondering if it's a good
idea to use it for storage and indexing, or should we store the documents
elsewhere and use it just for indexing. We're going to be storing and
indexing data that clients commit so it's absolutely imperative that we
don't lose anything if the ES index goes south.
Should we store the documents in say S3 and index in ES, or is there a
100% safe way to store documents in ES? Or should I say, a 100% reliable
way to recover if/when something goes wrong...
If an index does get corrupted which scenario will offer the best recovery
options?
Also, thinking about having 1 index for each client. Any notable pros/cons
to setting things up that way? Or should we do one index and reference
documents by clientid?
My recommendation is to also have an option to reindex the data when using
elasticsearch, at least for the time being. The aim is definitely to
eventually be a stable storage layer, but I recommend either doing backups
or storing the data elsewhere for now.
Hi all, I'm evaluating ES for our project and am wondering if it's a good
idea to use it for storage and indexing, or should we store the documents
elsewhere and use it just for indexing. We're going to be storing and
indexing data that clients commit so it's absolutely imperative that we
don't lose anything if the ES index goes south.
Should we store the documents in say S3 and index in ES, or is there a
100% safe way to store documents in ES? Or should I say, a 100% reliable
way to recover if/when something goes wrong...
If an index does get corrupted which scenario will offer the best recovery
options?
Also, thinking about having 1 index for each client. Any notable pros/cons
to setting things up that way? Or should we do one index and reference
documents by clientid?
So the post says: "A "user" based data flow, in theory, is perfect for an
index per user case. If you have enough nodes (each shard is a Lucene
index, which has a cost) in the cluster to do that, thats great, and
several very large scale ES users actually do that."
What is very large scale? How many nodes would you recommend in order to
support 100, 500, 1000, or 5000 clients using an "index per client" setup
where index size could range from 0 to 15,000,000 documents which are
typically 10-100 kb in size.
Its really hard to tell based on the numbers you gave, since it also
relates to what type of queries you execute, faceting / sorting, and what
type of machines you have the nodes are running on.
So the post says: "A "user" based data flow, in theory, is perfect for an
index per user case. If you have enough nodes (each shard is a Lucene
index, which has a cost) in the cluster to do that, thats great, and
several very large scale ES users actually do that."
What is very large scale? How many nodes would you recommend in order to
support 100, 500, 1000, or 5000 clients using an "index per client" setup
where index size could range from 0 to 15,000,000 documents which are
typically 10-100 kb in size.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.