We've got about the same project. Datas in CouchDB. Java Batch to fetch
datas from couchDb using _changes API and each couchDB Doc is sent to ES.
We don't use the couchDb river but I recommend to use it to start evaluate
ES as it's really easy to setup.
River manage add/update/delete so it will be very easy for you.
What I can suggest is to test it and make your own opinion of ES. I'm pretty
sure you're going to love it
So build a "small platform", 2Gb RAM and less than 100 Gb disk space and go
for it.
As I told you before, you can run ES on a laptop.
ES use RAM to store its indexes only if you ask for (see
Elasticsearch Platform — Find real-time answers at scale | Elastic ). By
default, ES use the local file system to store Lucene indexes.
ES use lot of RAM if you are doing faceting or sorting. So for tests
purpose, you can start with 2Gb RAM.
But, you will be more comfortable to go in production if you have more than
one node with fast disks (SSD) and lot of memory.
HTH
David
-----Message d'origine-----
De : elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com]
De la part de yojimbo87
Envoyé : mardi 22 novembre 2011 18:32
À : elasticsearch
Objet : Re: Elasticsearch with CouchDB and memory consumption
Thanks David for having patience with me.
Let's say I'm in this situation:- CouchDB is responsible for adding/
updating/deleting data and keep it durable- my dataset in CouchDB
takes about 10 GB of disk space- server has 2 GB of RAM- CouchDB
doesn't support dynamic ad-hoc querying and mapreduce doesn't suit my
needs- I need to be able to search/query my entire dataset dynamically
for documents based on their field values (that's why I would like to
evaluate ES for this functionality)- I need ES only for search
functionality among the dataset documents - add/edit/delete would be
taken care of by CouchDB
My concern is:
I understand that ES needs to index the entire dataset from CouchDB
before I can start searching/querying the data, but if my CouchDB
dataset takes 10 GB of disk space, wouldn't ES need ~10 GB of RAM to
index these documents (assuming that I don't want to ignore any
fields)? To be more clear, I would like to know how ES indexes data -
if it stores them only in RAM for fast access or also on disk (in case
the dataset can't fit into RAM). I guess the latter is how ES works,
so now I would have 10 GB of data in CouchDB and ~10 GB of data
indexed by ES (some data in RAM and most data on disk). Sorry if I'm
too annoying with my concern, but I would like to make things clear in
my head.
On Nov 22, 5:23 pm, "da...@pilato.fr" da...@pilato.fr wrote:
Just want to add something :
ES will not search directly within your dataset.
You will have to index all of your datas in ES (manually, with the couchDb
River, ...)
So, when your datas will be indexed, even if you shutdown couchdb, you
will be
able to search your datas.
Not sure that's what you imagine by having a "search/ad-hoc query
functionality
on your dataset".
David
Le 22 novembre 2011 à 17:03, yojimbo87 bosak.to...@gmail.com a écrit :
So if I understand it correctly - I can have CouchDB which will
durably persist my data on disk, and size of this dataset can be
greater than amount of RAM (to some extent or limit of course) which
will be used by ES to provide search/ad-hoc query functionality on my
dataset. What I need is an ad-hoc querying for my CouchDB dataset, but
I was worried what would happen if dataset stored in CouchDB on disk
would be greater than amount of RAM which can be assigned to ES for
managing search/query functionality on top of my dataset.
On Nov 22, 1:52 pm, Shay Banon kim...@gmail.com wrote:
Lucene, and Elasticsearch requiers certain amount of memory to
operate. It
starts with Lucene to hold parts of the inverted index in memory to
improve
search performance (can be controlled), and Elasticsearch for things
like
faceting on fields. If there isn't enough memory, then you will
usually get
a failure logged (OutOfMemoryException) and you need to make sure to
allocated more memory. The nodes info and nodes stats gives statistics
regarding memory usage and boundaries.
On Mon, Nov 21, 2011 at 11:13 PM, yojimbo87 bosak.to...@gmail.com
wrote:
By memory I meant RAM - data fit into disk, but not into RAM. For
example my couchdb dataset is 10 GB, but I have only 2 GB of RAM -
how
ES deals with this situation when only ~1/5 of the original dataset
can fit into RAM.
On Nov 21, 5:01 pm, "da...@pilato.fr" da...@pilato.fr wrote:
I don't know if you are talking about individual size of each
document
you get
from couchDb or global sizeof your ES index.
You are talking about memory. Are you meaning disk space ?
Let me say that I never see ES having problems to manage
individuals
documents
even with large ones (more than 1000 elements in an array with
more than
hundred
fields each).
That said, I was running out of disk space in my production
cluster last
week
and ES handle it very well :
- Sending information back to the client that the document has not
been
indexed
- Let users performs searches without any problem
Not sure I answered to your fears...
Cheers
David.
Le 21 novembre 2011 à 16:25, yojimbo87 bosak.to...@gmail.com a
écrit :
Thanks David, this answered my second question, however I would
also
like to know what happens in case my dataset doesn't fit into
memory.
Is it still possible to use ES functionality when there is not
enough
RAM to hold all documents and index them?
--
David Pilatohttp://dev.david.pilato.fr/
Twitter : @dadoonet