Using ElasticSearch as Primary Data Store


(vaidik) #1

Hi Folks,

I am working on a project where we have the following specifications:

  1. Collect events at the rate of about 50-100 events/second creating
    almost new documents for each event.
  2. We are going to query and perform multiple terms facets on that data.
  3. We are not going to be performing Full Text Searches.

I have played around with Elasticsearch and it solves the purpose from
application point-of-view. We are able to solve all our application
requirements.
We would like to use just one data store where the entire data resides. And
I know Elasticsearch can be used as a primary data store. But, with this
scale of documents, I wonder if anyone is using Elasticsearch as their
primary data store and not storing data anywhere else at all for purposes
like recovery, checking accuracy, etc.

I came across a case-study on Elasticsearch.com which mentions DataDog is
using Elasticsearch as the only datastore and they moved from Postgres to
ES. But the case-study itself doesn't have more specific information on the
challenges and what might one be ready for when you want to have
Elasticsearch as the single source of truth and how to prepare for no data
loss at all
if something goes wrong in the ES cluster. What scenarios must
one be prepared for? What else?

Are there are any stories around this use-case that might help us? Would be
glad to be pointed in the right direction or get some advices.

Thanks,

Vaidik Kapoor
vaidikkapoor.info

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5nnSxmadfDz3JQ%3De3KH%3D0n6xqN_qPCv1NjB%2BmjmKTnf4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Radu Gheorghe) #2

Hello Vaidik,

I think ES is pretty safe to use as a primary data store: if a node goes
down and you have replicas it will continue to function and so on.

That said, if something goes seriously wrong in your cluster and your data
becomes corrupted across all the replicas or something similarly tragic,
you can go for backups. I guess this is valid for all data stores.

If you need backups, you can look at the snapshot-restore
APIhttp://www.elasticsearch.org/guide/en/elasticsearch/reference/master/modules-snapshots.htmlwhich
just came out with 1.0.0
beta 2 http://www.elasticsearch.org/blog/1-0-0-beta2-released/.

On Tue, Dec 10, 2013 at 2:28 PM, Vaidik Kapoor kapoor.vaidik@gmail.comwrote:

Hi Folks,

I am working on a project where we have the following specifications:

  1. Collect events at the rate of about 50-100 events/second creating
    almost new documents for each event.
  2. We are going to query and perform multiple terms facets on that
    data.
  3. We are not going to be performing Full Text Searches.

I have played around with Elasticsearch and it solves the purpose from
application point-of-view. We are able to solve all our application
requirements.
We would like to use just one data store where the entire data resides.
And I know Elasticsearch can be used as a primary data store. But, with
this scale of documents, I wonder if anyone is using Elasticsearch as their
primary data store and not storing data anywhere else at all for purposes
like recovery, checking accuracy, etc.

I came across a case-study on Elasticsearch.com which mentions DataDog is
using Elasticsearch as the only datastore and they moved from Postgres to
ES. But the case-study itself doesn't have more specific information on the
challenges and what might one be ready for when you want to have
Elasticsearch as the single source of truth and how to prepare for no
data loss at all
if something goes wrong in the ES cluster. What
scenarios must one be prepared for? What else?

Are there are any stories around this use-case that might help us? Would
be glad to be pointed in the right direction or get some advices.

Thanks,

Vaidik Kapoor
vaidikkapoor.info

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CACWtv5nnSxmadfDz3JQ%3De3KH%3D0n6xqN_qPCv1NjB%2BmjmKTnf4Q%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_05h2Rj7rpetvEvoXTZVaqBBza7deM6mTu%3D%3D6m-hMc%2B9A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(vaidik) #3

Fair enough. That introduces me to a couple of things to explore.

I shall shoot more questions on this very thread if and when I have more to
ask/consult/share.

Thanks for your answer. :slight_smile:

Vaidik Kapoor
vaidikkapoor.info

On 10 December 2013 18:45, Radu Gheorghe radu.gheorghe@sematext.com wrote:

Hello Vaidik,

I think ES is pretty safe to use as a primary data store: if a node goes
down and you have replicas it will continue to function and so on.

That said, if something goes seriously wrong in your cluster and your data
becomes corrupted across all the replicas or something similarly tragic,
you can go for backups. I guess this is valid for all data stores.

If you need backups, you can look at the snapshot-restore APIhttp://www.elasticsearch.org/guide/en/elasticsearch/reference/master/modules-snapshots.htmlwhich just came out with 1.0.0
beta 2 http://www.elasticsearch.org/blog/1-0-0-beta2-released/.

On Tue, Dec 10, 2013 at 2:28 PM, Vaidik Kapoor kapoor.vaidik@gmail.comwrote:

Hi Folks,

I am working on a project where we have the following specifications:

  1. Collect events at the rate of about 50-100 events/second creating
    almost new documents for each event.
  2. We are going to query and perform multiple terms facets on that
    data.
  3. We are not going to be performing Full Text Searches.

I have played around with Elasticsearch and it solves the purpose from
application point-of-view. We are able to solve all our application
requirements.
We would like to use just one data store where the entire data resides.
And I know Elasticsearch can be used as a primary data store. But, with
this scale of documents, I wonder if anyone is using Elasticsearch as their
primary data store and not storing data anywhere else at all for purposes
like recovery, checking accuracy, etc.

I came across a case-study on Elasticsearch.com which mentions DataDog is
using Elasticsearch as the only datastore and they moved from Postgres to
ES. But the case-study itself doesn't have more specific information on the
challenges and what might one be ready for when you want to have
Elasticsearch as the single source of truth and how to prepare for no
data loss at all
if something goes wrong in the ES cluster. What
scenarios must one be prepared for? What else?

Are there are any stories around this use-case that might help us? Would
be glad to be pointed in the right direction or get some advices.

Thanks,

Vaidik Kapoor
vaidikkapoor.info

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CACWtv5nnSxmadfDz3JQ%3De3KH%3D0n6xqN_qPCv1NjB%2BmjmKTnf4Q%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHXA0_05h2Rj7rpetvEvoXTZVaqBBza7deM6mTu%3D%3D6m-hMc%2B9A%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5kzQxODbfRiNm3fUZ-xnU9Bs7BG%2BrooQdVLrHCuB3nKuQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Eugene Strokin) #4

I use ES as a primary datasource from 0.2 version. It is in production for almost 2 years. Starting from 5 shards all on the same node, to 1 replica of those 5 shards on 3 nodes. Serves about a doezen requests per seconds in average. All kind of requests, searches, filtering, sorting, faceting. I had transferring whole cluster even to different datacenters with zero down time several times. All problems I had was only because I did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale, Cassandra - not easy to support complex (and I wouldn't even call it complex really) data structure. Hbase on hadoop - too low level. And performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(vaidik) #5

For our use-case, we need facets desperately. Otherwise we will have to do
that in the application logic, which is not ideal and honestly a lot of
work too. ES gives me that. However, with the number of documents we need
to index per second (30-50 per second, and this number is going to grow
with time), I wonder what do people do to make sure that:

  • There are least chances of data loss. You cannot flush segments to disk
    very quickly as that won't be optimal and to my knowledge a lot of
    unoptimized segments will be created if I manually use the Flush API. So
    what does ES do, when data has been written to translog but the operations
    have not been flushed and the node goes down?
  • If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM heap
properly and how it can affect my cluster. Consider this: we have three
nodes and we are indexing data in it at the rate of 30-50 docs per second.
When I started the cluster, JVM heap usage was low (about 2-4% on each
node). With time, that keeps on growing and stabilizes in between 81-94%.
Now, in the meanwhile I am just indexing and not querying data at all from
the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was giving
me long long pauses (about 13-17 seconds for garbage collection) which is
not ideal. With G1, GC is frequent and quicker (so far I have seen about 1
second). This works but I am always concerned that the JVM Heap usage is so
high and if a little more load is in the pipeline, then what will happen.
Will ES be able to take it or there are chances of experiencing OutOfMemory
exceptions, leading the node to go down. Obviously, this is something that
I will have to test according to my use-case, but I am interested in
knowing if there is someone around here who has experienced similar
problems and have found the solution or a work around.

After some time, GC happens so quickly that I can make out that it is
affecting indexing (I am indexing using a Rabbit consumer written in Python
and after every 10-20 seconds, I'd see a peak in the queue, suggesting that
the consumer is not able to consume, further suggesting that the consumer
is not able to quickly write to ES, leading me to assume that GC is the
cause of the slow write as the CPU is busy.

So:

  • What are/could be the reasons of such heap usage? What is ES doing with
    so much heap?
  • How can I keep that in control?

Thanks

Vaidik Kapoor
vaidikkapoor.info

On 11 December 2013 08:22, Eugene Strokin eugene@strokin.info wrote:

I use ES as a primary datasource from 0.2 version. It is in production for
almost 2 years. Starting from 5 shards all on the same node, to 1 replica
of those 5 shards on 3 nodes. Serves about a doezen requests per seconds in
average. All kind of requests, searches, filtering, sorting, faceting. I
had transferring whole cluster even to different datacenters with zero down
time several times. All problems I had was only because I did something
wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale, Cassandra

  • not easy to support complex (and I wouldn't even call it complex really)
    data structure. Hbase on hadoop - too low level. And performance of ES is
    very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

ES picks up translogs and replays them the next start after a node went
down.

You work heavily with facets, and I share your concerns about OOMs inducing
flakiness to the whole cluster.

Have you checked how your cache on the heap is used? Note that some caches
are turned on by default and may interfere with your facets.

You should also look into the new aggregation framework of 1.0.0.Beta2 if
the new faceting is less resource consuming.

Regarding the indexing, check if you have a strategy for segment merging
and throttling. Setting custom values can take much pressure off the heap,
especially when segments grow very large.

And finally, check if you can identify the sweet spot when to add nodes, if
heap usage is just getting too high.

Jörg

On Wed, Dec 11, 2013 at 10:03 AM, Vaidik Kapoor kapoor.vaidik@gmail.comwrote:

For our use-case, we need facets desperately. Otherwise we will have to do
that in the application logic, which is not ideal and honestly a lot of
work too. ES gives me that. However, with the number of documents we need
to index per second (30-50 per second, and this number is going to grow
with time), I wonder what do people do to make sure that:

  • There are least chances of data loss. You cannot flush segments to disk
    very quickly as that won't be optimal and to my knowledge a lot of
    unoptimized segments will be created if I manually use the Flush API. So
    what does ES do, when data has been written to translog but the operations
    have not been flushed and the node goes down?
  • If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM
heap properly and how it can affect my cluster. Consider this: we have
three nodes and we are indexing data in it at the rate of 30-50 docs per
second. When I started the cluster, JVM heap usage was low (about 2-4% on
each node). With time, that keeps on growing and stabilizes in between
81-94%. Now, in the meanwhile I am just indexing and not querying data at
all from the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was
giving me long long pauses (about 13-17 seconds for garbage collection)
which is not ideal. With G1, GC is frequent and quicker (so far I have seen
about 1 second). This works but I am always concerned that the JVM Heap
usage is so high and if a little more load is in the pipeline, then what
will happen. Will ES be able to take it or there are chances of
experiencing OutOfMemory exceptions, leading the node to go down.
Obviously, this is something that I will have to test according to my
use-case, but I am interested in knowing if there is someone around here
who has experienced similar problems and have found the solution or a work
around.

After some time, GC happens so quickly that I can make out that it is
affecting indexing (I am indexing using a Rabbit consumer written in Python
and after every 10-20 seconds, I'd see a peak in the queue, suggesting that
the consumer is not able to consume, further suggesting that the consumer
is not able to quickly write to ES, leading me to assume that GC is the
cause of the slow write as the CPU is busy.

So:

  • What are/could be the reasons of such heap usage? What is ES doing with
    so much heap?
  • How can I keep that in control?

Thanks

Vaidik Kapoor
vaidikkapoor.info

On 11 December 2013 08:22, Eugene Strokin eugene@strokin.info wrote:

I use ES as a primary datasource from 0.2 version. It is in production
for almost 2 years. Starting from 5 shards all on the same node, to 1
replica of those 5 shards on 3 nodes. Serves about a doezen requests per
seconds in average. All kind of requests, searches, filtering, sorting,
faceting. I had transferring whole cluster even to different datacenters
with zero down time several times. All problems I had was only because I
did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale,
Cassandra - not easy to support complex (and I wouldn't even call it
complex really) data structure. Hbase on hadoop - too low level. And
performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoErrt6LDFDwMXKsqBmyQoVgi8hn6mDNNDXA1-u5cXt5bg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(davrob) #7

Hi Jorg,

So if my index files get corrupted and I restore (using the new Snapshot
API) the cluster, is there a way of moving the translogs from the old
cluster nodes to the new one?

  • David.

On Wednesday, 11 December 2013 11:10:46 UTC, Jörg Prante wrote:

ES picks up translogs and replays them the next start after a node went
down.

You work heavily with facets, and I share your concerns about OOMs
inducing flakiness to the whole cluster.

Have you checked how your cache on the heap is used? Note that some caches
are turned on by default and may interfere with your facets.

You should also look into the new aggregation framework of 1.0.0.Beta2 if
the new faceting is less resource consuming.

Regarding the indexing, check if you have a strategy for segment merging
and throttling. Setting custom values can take much pressure off the heap,
especially when segments grow very large.

And finally, check if you can identify the sweet spot when to add nodes,
if heap usage is just getting too high.

Jörg

On Wed, Dec 11, 2013 at 10:03 AM, Vaidik Kapoor <kapoor...@gmail.com<javascript:>

wrote:

For our use-case, we need facets desperately. Otherwise we will have to
do that in the application logic, which is not ideal and honestly a lot of
work too. ES gives me that. However, with the number of documents we need
to index per second (30-50 per second, and this number is going to grow
with time), I wonder what do people do to make sure that:

  • There are least chances of data loss. You cannot flush segments to disk
    very quickly as that won't be optimal and to my knowledge a lot of
    unoptimized segments will be created if I manually use the Flush API. So
    what does ES do, when data has been written to translog but the operations
    have not been flushed and the node goes down?
  • If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM
heap properly and how it can affect my cluster. Consider this: we have
three nodes and we are indexing data in it at the rate of 30-50 docs per
second. When I started the cluster, JVM heap usage was low (about 2-4% on
each node). With time, that keeps on growing and stabilizes in between
81-94%. Now, in the meanwhile I am just indexing and not querying data at
all from the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was
giving me long long pauses (about 13-17 seconds for garbage collection)
which is not ideal. With G1, GC is frequent and quicker (so far I have seen
about 1 second). This works but I am always concerned that the JVM Heap
usage is so high and if a little more load is in the pipeline, then what
will happen. Will ES be able to take it or there are chances of
experiencing OutOfMemory exceptions, leading the node to go down.
Obviously, this is something that I will have to test according to my
use-case, but I am interested in knowing if there is someone around here
who has experienced similar problems and have found the solution or a work
around.

After some time, GC happens so quickly that I can make out that it is
affecting indexing (I am indexing using a Rabbit consumer written in Python
and after every 10-20 seconds, I'd see a peak in the queue, suggesting that
the consumer is not able to consume, further suggesting that the consumer
is not able to quickly write to ES, leading me to assume that GC is the
cause of the slow write as the CPU is busy.

So:

  • What are/could be the reasons of such heap usage? What is ES doing with
    so much heap?
  • How can I keep that in control?

Thanks

Vaidik Kapoor
vaidikkapoor.info

On 11 December 2013 08:22, Eugene Strokin <eug...@strokin.info<javascript:>

wrote:

I use ES as a primary datasource from 0.2 version. It is in production
for almost 2 years. Starting from 5 shards all on the same node, to 1
replica of those 5 shards on 3 nodes. Serves about a doezen requests per
seconds in average. All kind of requests, searches, filtering, sorting,
faceting. I had transferring whole cluster even to different datacenters
with zero down time several times. All problems I had was only because I
did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale,
Cassandra - not easy to support complex (and I wouldn't even call it
complex really) data structure. Hbase on hadoop - too low level. And
performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af52a997-6880-41ca-afcc-c9ac47a9098c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Matt Weber) #8

For those of you concerned with OOM and heap issues due to facets and/or
caching I would consider waiting for 1.0 which will have:

Field Data Circuit breaker to limit how much memory is used.

Disk based fielddata/docvalues

The docvalues is in beta2, I imagine the circuit breaker will be in the
next. On top of that there has been a lot of work going on to reduce the
number of objects created so there will be less GC's in general.

Thanks,
Matt Weber

On Wed, Dec 11, 2013 at 8:22 AM, davrob2 daviroberts@gmail.com wrote:

Hi Jorg,

So if my index files get corrupted and I restore (using the new Snapshot
API) the cluster, is there a way of moving the translogs from the old
cluster nodes to the new one?

  • David.

On Wednesday, 11 December 2013 11:10:46 UTC, Jörg Prante wrote:

ES picks up translogs and replays them the next start after a node went
down.

You work heavily with facets, and I share your concerns about OOMs
inducing flakiness to the whole cluster.

Have you checked how your cache on the heap is used? Note that some
caches are turned on by default and may interfere with your facets.

You should also look into the new aggregation framework of 1.0.0.Beta2 if
the new faceting is less resource consuming.

Regarding the indexing, check if you have a strategy for segment merging
and throttling. Setting custom values can take much pressure off the heap,
especially when segments grow very large.

And finally, check if you can identify the sweet spot when to add nodes,
if heap usage is just getting too high.

Jörg

On Wed, Dec 11, 2013 at 10:03 AM, Vaidik Kapoor kapoor...@gmail.comwrote:

For our use-case, we need facets desperately. Otherwise we will have to
do that in the application logic, which is not ideal and honestly a lot of
work too. ES gives me that. However, with the number of documents we need
to index per second (30-50 per second, and this number is going to grow
with time), I wonder what do people do to make sure that:

  • There are least chances of data loss. You cannot flush segments to
    disk very quickly as that won't be optimal and to my knowledge a lot of
    unoptimized segments will be created if I manually use the Flush API. So
    what does ES do, when data has been written to translog but the operations
    have not been flushed and the node goes down?
  • If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM
heap properly and how it can affect my cluster. Consider this: we have
three nodes and we are indexing data in it at the rate of 30-50 docs per
second. When I started the cluster, JVM heap usage was low (about 2-4% on
each node). With time, that keeps on growing and stabilizes in between
81-94%. Now, in the meanwhile I am just indexing and not querying data at
all from the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was
giving me long long pauses (about 13-17 seconds for garbage collection)
which is not ideal. With G1, GC is frequent and quicker (so far I have seen
about 1 second). This works but I am always concerned that the JVM Heap
usage is so high and if a little more load is in the pipeline, then what
will happen. Will ES be able to take it or there are chances of
experiencing OutOfMemory exceptions, leading the node to go down.
Obviously, this is something that I will have to test according to my
use-case, but I am interested in knowing if there is someone around here
who has experienced similar problems and have found the solution or a work
around.

After some time, GC happens so quickly that I can make out that it is
affecting indexing (I am indexing using a Rabbit consumer written in Python
and after every 10-20 seconds, I'd see a peak in the queue, suggesting that
the consumer is not able to consume, further suggesting that the consumer
is not able to quickly write to ES, leading me to assume that GC is the
cause of the slow write as the CPU is busy.

So:

  • What are/could be the reasons of such heap usage? What is ES doing
    with so much heap?
  • How can I keep that in control?

Thanks

Vaidik Kapoor
vaidikkapoor.info

On 11 December 2013 08:22, Eugene Strokin eug...@strokin.info wrote:

I use ES as a primary datasource from 0.2 version. It is in production
for almost 2 years. Starting from 5 shards all on the same node, to 1
replica of those 5 shards on 3 nodes. Serves about a doezen requests per
seconds in average. All kind of requests, searches, filtering, sorting,
faceting. I had transferring whole cluster even to different datacenters
with zero down time several times. All problems I had was only because I
did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale,
Cassandra - not easy to support complex (and I wouldn't even call it
complex really) data structure. Hbase on hadoop - too low level. And
performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzG
j%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/af52a997-6880-41ca-afcc-c9ac47a9098c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoAvO6085KeUo0yf-mBQg0qgrF86g-cZL%2Bb-EM-DhDwFiA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9