Since you are using the s3 gateway, you are safe up to the last checkpoint
that happened (it hapens periodically but a checkpoint can take time).
Sorry, I missed the size of heap allocated to ES, I typically recommend
using ~50% of the machine memory to the ES_HEAP_SIZE.
One more thing, if you start another machine to form a cluster, you will
now have 2 machines in the cluster. If you created the index / indices with
default number of replicas, then it is set to 1, which means you will have
2 copies of each shard. Once you start the second node, the replicas will
be allocate on it, so you will end up with the same capacity problems.
If you don't care about replicas (less HA), then you can dynamically change
the number of replicas to 0, otherwise, you will need to provision
machines appropriately.
One last thing, I recommend using local (the default) gateway on AWS, not
s3, because of the overhad it comes with the the time it can take to do a
checkpoint. This means each node local drive (or EBS) is used for recovery.
On Thu, May 10, 2012 at 11:59 PM, andym imwellnow@gmail.com wrote:
Yes, it's single node and ES_HEAP_SIZE is as 5120M (not 520M)
You're are right, I'll have to move to bigger machine or split it into
2 machines as it started happening relatively recently (at ~300M docs)
My question is whether these restarts are safe at the moment and do
not lead to data loss in ES (where ES would return "OK" to processing
threads which would mark jobs as completed, but then ES would not
persist them due to restart). ES is currently running with
threadpool.index.type: cached
threadpool.bulk.type: cached
I tried to make these "blocking" but then processing threads were idle
most of the time just waiting for ES to return
On May 10, 4:48 pm, Shay Banon kim...@gmail.com wrote:
We are talking about a single ES node, right? For the amount of data that
you indexed, seems like you are hitting memory limits, 520mb for the
amount
of data you have is not enough, probably should go to 3.5 or 4 gb (out of
the 7gb this instance type has) as ES_HEAP_SIZE.
On Thu, May 10, 2012 at 11:38 PM, andym imwell...@gmail.com wrote:
Hi,
I am currently running indexing on c1.xlarge (with 4 ephemeral drives
in RAID0 and gateway going to S3). Everything works great except every
several hours ES gets overwhelmed and indexing slows significantly.
From what I can see from bigdesk during “normal” ES operation “Heap
Mem” window has jigsaw pattern but when it gets overwhelmed seems like
no GC happens (no jigsaw pattern in bigdesk) and memory is maxed
( configured at 5120M)
Doing ES service restart (through “bin/service/elasticsearch –
restart”) solves the problem for a few hours but then the problem re-
appears.
I wonder whether restarting ES when it is on such state going to lead
to any data loss so I can put this into a cron job to assure indexing
continues (or whether there are any better ways to address the
problem)
Thanks,
-- Andy
P.S. Some background: I am running ES 19.2 with “refresh interval” set
to zero and ES currently has about 400 million documents in 2 indexes
with about 600G total index size (and I expect about 600M docs more
with around 1T of data. The mapping has _source set to compressed.)
The data processing and insertion into ES is done by multiple threads
on 20 or so m1.xlarge machines (when ES goes down or returns errors
threads back-off with exponential timeout and restart when ES is back
on-line). There are 8-12 threads per machine doing mostly data
processing and if I were to trust that in bigdesk “HTTP channels”
indicate number of active connections, then it means that 30-40
threads are connected to ES at any given time. Indexing rate is about
750 docs per second sometime maxing out at 10,000 docs per second. The
average doc size is about 5000 bytes