*Can I use elastic search as a data store? * I keep hearing I can. I
definitely understand the classic issues like lack of transactions etc
compared to database. But that is not of a concern for me. I would prefer
to keep the document and the index together, rather than storing data in
cassandra and the index in Elastic Search. Some people say that keeping
data and index together is an anti-pattermn because that increases the size
of the lucene index and so you cannot scale it separately. What do you
think?
Assuming I can store data in elastic search, does the data get stored
with the index, which means is my index size going to be more compared to
not storing the data. Because, if that is the case, I can store only index
in elastic search and go to Cassandra or some other NOSQL database to get
the actual doc. Is storing data in index an anti-pattern?
I have five(5) 128GB, 16 core, 8x500GB dedicated hardware. I would like
to index 200million documents with a total size 10TB. In general, how
many documents can I index? What is the best you have seen. There are
about 50 fields and there is one field which concatenated of 10 webpages
for a website. Are we talking 100 or 200, 1000, 10000 ...docs per second
per machine?
Related to question 3, currently we index in Hadoop and bring the index
in to solr (which is what we use currently). It looks like we cannot do
something like that in elastic search. Is that right? In other words, you
have to use the ApI to index?
Yes, you can. A couple of years ago, I would have said no. That being
said, I don't use elasticsearch as my primary data store since any loss of
data is not acceptable and I want to be 100% sure.
Keeping the data co-located with the index likely makes things faster.
Maybe with Lucene it was an anti-pattern, but with the distributed nature
of elasticsearch I would no longer consider that the case.
It all depends on the data, index setup and h/w. Using the Java node
client and the bulk APIs will help cut down on overhead. You'd really need
to run tests to confirm. I don't think 1000/sec is unreasonable.
On Friday, February 8, 2013 5:04:16 AM UTC-7, Karthik Shyamsunder wrote:
*Can I use Elasticsearch as a data store? * I keep hearing I can. I
definitely understand the classic issues like lack of transactions etc
compared to database. But that is not of a concern for me. I would prefer
to keep the document and the index together, rather than storing data in
cassandra and the index in Elastic Search. Some people say that keeping
data and index together is an anti-pattermn because that increases the size
of the lucene index and so you cannot scale it separately. What do you
think?
Assuming I can store data in Elasticsearch, does the data get stored
with the index, which means is my index size going to be more compared to
not storing the data. Because, if that is the case, I can store only index
in Elasticsearch and go to Cassandra or some other NOSQL database to get
the actual doc. Is storing data in index an anti-pattern?
I have five(5) 128GB, 16 core, 8x500GB dedicated hardware. I would
like to index 200million documents with a total size 10TB. In general,
how many documents can I index? What is the best you have seen. There
are about 50 fields and there is one field which concatenated of 10
webpages for a website. Are we talking 100 or 200, 1000, 10000 ...docs
per second per machine?
Related to question 3, currently we index in Hadoop and bring the index
in to solr (which is what we use currently). It looks like we cannot do
something like that in Elasticsearch. Is that right? In other words, you
have to use the ApI to index?
The common response is "it depends", because every situation is unique.
Doc size, number of docs, analyzer complexity, hardware, query pattern,
etc etc. All the variables make it very difficult to predict with any
accuracy what kind of performance to expect.
However, that answer is always terribly unsatisfying, so here are some
personal results I've obtained from benchmarking.
My cluster is x3 servers with 32gb RAM, 8 cores and a software RAID1
disk setup, traditional 7200 rpm disks.
Index was [3 shard, 1 replica].
I could reliably hit 10,000-40,000 indexing requests/second using the
non-bulk API. I havent played with the bulk API yet to see what kind of
performance is capable. Doc size, field count and analyzer complexity
drastically altered indexing speed.
QPS (whatever it was for the test) remained relatively constant up to
100m docs at which point I stopped the tests. Max index size only hit
100Gb though, so you can see that the size of docs I was dealing with were
still relatively small.
On my cluster, Disk I/O saturated before any other resource.
As soon as you start querying at the same time as indexing...these
benchmarks go out the window. Adding query overhead makes it a very
different matter.
Hope that helps give you at least an idea of what's possible. But
honestly, it really depends and you should run some sample benchmarks
yourself.
-Zach
On Friday, February 8, 2013 7:04:16 AM UTC-5, Karthik Shyamsunder wrote:
*Can I use Elasticsearch as a data store? * I keep hearing I can. I
definitely understand the classic issues like lack of transactions etc
compared to database. But that is not of a concern for me. I would prefer
to keep the document and the index together, rather than storing data in
cassandra and the index in Elastic Search. Some people say that keeping
data and index together is an anti-pattermn because that increases the size
of the lucene index and so you cannot scale it separately. What do you
think?
Assuming I can store data in Elasticsearch, does the data get stored
with the index, which means is my index size going to be more compared to
not storing the data. Because, if that is the case, I can store only index
in Elasticsearch and go to Cassandra or some other NOSQL database to get
the actual doc. Is storing data in index an anti-pattern?
I have five(5) 128GB, 16 core, 8x500GB dedicated hardware. I would
like to index 200million documents with a total size 10TB. In general,
how many documents can I index? What is the best you have seen. There
are about 50 fields and there is one field which concatenated of 10
webpages for a website. Are we talking 100 or 200, 1000, 10000 ...docs
per second per machine?
Related to question 3, currently we index in Hadoop and bring the index
in to solr (which is what we use currently). It looks like we cannot do
something like that in Elasticsearch. Is that right? In other words, you
have to use the ApI to index?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.