Questions relating to elastic search

  1. *Can I use elastic search as a data store? * I keep hearing I can. I
    definitely understand the classic issues like lack of transactions etc
    compared to database. But that is not of a concern for me. I would prefer
    to keep the document and the index together, rather than storing data in
    cassandra and the index in Elastic Search. Some people say that keeping
    data and index together is an anti-pattermn because that increases the size
    of the lucene index and so you cannot scale it separately. What do you
    think?

  2. Assuming I can store data in elastic search, does the data get stored
    with the index, which means is my index size going to be more compared to
    not storing the data. Because, if that is the case, I can store only index
    in elastic search and go to Cassandra or some other NOSQL database to get
    the actual doc. Is storing data in index an anti-pattern?

  3. I have five(5) 128GB, 16 core, 8x500GB dedicated hardware. I would like
    to index 200million documents with a total size 10TB. In general, how
    many documents can I index? What is the best you have seen. There are
    about 50 fields and there is one field which concatenated of 10 webpages
    for a website. Are we talking 100 or 200, 1000, 10000 ...docs per second
    per machine?

  4. Related to question 3, currently we index in Hadoop and bring the index
    in to solr (which is what we use currently). It looks like we cannot do
    something like that in elastic search. Is that right? In other words, you
    have to use the ApI to index?

Please advise.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. Yes, you can. A couple of years ago, I would have said no. That being
    said, I don't use elasticsearch as my primary data store since any loss of
    data is not acceptable and I want to be 100% sure.

  2. Keeping the data co-located with the index likely makes things faster.
    Maybe with Lucene it was an anti-pattern, but with the distributed nature
    of elasticsearch I would no longer consider that the case.

  3. It all depends on the data, index setup and h/w. Using the Java node
    client and the bulk APIs will help cut down on overhead. You'd really need
    to run tests to confirm. I don't think 1000/sec is unreasonable.

  4. A lot of datastores have integrations/plugins available. I'm not
    familiar with doing this, but this seems to be an
    option: GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search

Best Regards,
Paul

On Friday, February 8, 2013 5:04:16 AM UTC-7, Karthik Shyamsunder wrote:

  1. *Can I use Elasticsearch as a data store? * I keep hearing I can. I
    definitely understand the classic issues like lack of transactions etc
    compared to database. But that is not of a concern for me. I would prefer
    to keep the document and the index together, rather than storing data in
    cassandra and the index in Elastic Search. Some people say that keeping
    data and index together is an anti-pattermn because that increases the size
    of the lucene index and so you cannot scale it separately. What do you
    think?

  2. Assuming I can store data in Elasticsearch, does the data get stored
    with the index, which means is my index size going to be more compared to
    not storing the data. Because, if that is the case, I can store only index
    in Elasticsearch and go to Cassandra or some other NOSQL database to get
    the actual doc. Is storing data in index an anti-pattern?

  3. I have five(5) 128GB, 16 core, 8x500GB dedicated hardware. I would
    like to index 200million documents with a total size 10TB. In general,
    how many documents can I index? What is the best you have seen. There
    are about 50 fields and there is one field which concatenated of 10
    webpages for a website. Are we talking 100 or 200, 1000, 10000 ...docs
    per second per machine?

  4. Related to question 3, currently we index in Hadoop and bring the index
    in to solr (which is what we use currently). It looks like we cannot do
    something like that in Elasticsearch. Is that right? In other words, you
    have to use the ApI to index?

Please advise.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'll weigh in on #3:

The common response is "it depends", because every situation is unique.
Doc size, number of docs, analyzer complexity, hardware, query pattern,
etc etc. All the variables make it very difficult to predict with any
accuracy what kind of performance to expect.

However, that answer is always terribly unsatisfying, so here are some
personal results I've obtained from benchmarking.

  • My cluster is x3 servers with 32gb RAM, 8 cores and a software RAID1
    disk setup, traditional 7200 rpm disks.
  • Index was [3 shard, 1 replica].
  • I could reliably hit 10,000-40,000 indexing requests/second using the
    non-bulk API. I havent played with the bulk API yet to see what kind of
    performance is capable. Doc size, field count and analyzer complexity
    drastically altered indexing speed.
  • QPS (whatever it was for the test) remained relatively constant up to
    100m docs at which point I stopped the tests. Max index size only hit
    100Gb though, so you can see that the size of docs I was dealing with were
    still relatively small.
  • On my cluster, Disk I/O saturated before any other resource.
  • As soon as you start querying at the same time as indexing...these
    benchmarks go out the window. Adding query overhead makes it a very
    different matter.

Hope that helps give you at least an idea of what's possible. But
honestly, it really depends and you should run some sample benchmarks
yourself.

-Zach

On Friday, February 8, 2013 7:04:16 AM UTC-5, Karthik Shyamsunder wrote:

  1. *Can I use Elasticsearch as a data store? * I keep hearing I can. I
    definitely understand the classic issues like lack of transactions etc
    compared to database. But that is not of a concern for me. I would prefer
    to keep the document and the index together, rather than storing data in
    cassandra and the index in Elastic Search. Some people say that keeping
    data and index together is an anti-pattermn because that increases the size
    of the lucene index and so you cannot scale it separately. What do you
    think?

  2. Assuming I can store data in Elasticsearch, does the data get stored
    with the index, which means is my index size going to be more compared to
    not storing the data. Because, if that is the case, I can store only index
    in Elasticsearch and go to Cassandra or some other NOSQL database to get
    the actual doc. Is storing data in index an anti-pattern?

  3. I have five(5) 128GB, 16 core, 8x500GB dedicated hardware. I would
    like to index 200million documents with a total size 10TB. In general,
    how many documents can I index? What is the best you have seen. There
    are about 50 fields and there is one field which concatenated of 10
    webpages for a website. Are we talking 100 or 200, 1000, 10000 ...docs
    per second per machine?

  4. Related to question 3, currently we index in Hadoop and bring the index
    in to solr (which is what we use currently). It looks like we cannot do
    something like that in Elasticsearch. Is that right? In other words, you
    have to use the ApI to index?

Please advise.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.