Document update/fetch rate suddenly floored

Hi,
We have a elasticsearch setup in which we index documents and with the user
data pouring in, we update documents. Every document has an expiry of 30
days and with every update the _ttl is also updated. At any given point, we
have around 150 million documents.
This setup has been running successfully in production from last couple of
months but from last few days we are experiencing enormously slow document
fetch and update. Earlier it used to take 1-2 seconds to update 100
documents, while now on an average it takes around 50-70 secs. The rate at
which documents are indexed and update requests are generated is more or
less the same, so it cant be that this behaviour is because of more traffic
coming in.
Any suggestions on what might possibly be going wrong?
ES Setup: 10 shards, 1 replica, 10 nodes, 24 GB RAM, 16GB Heap, spinning
disks.

Cheers
Nitish

--

Can you share (gist) the output of the nodes stats api
(/_nodes/stats?all) and index segment api (/_stats?all)? This gives a
better insight what might be the cause of the slowdown. A possible
cause might be merging.

Martijn

On 12 November 2012 12:41, Nitish Sharma sharmanitishdutt@gmail.com wrote:

Hi,
We have a elasticsearch setup in which we index documents and with the user
data pouring in, we update documents. Every document has an expiry of 30
days and with every update the _ttl is also updated. At any given point, we
have around 150 million documents.
This setup has been running successfully in production from last couple of
months but from last few days we are experiencing enormously slow document
fetch and update. Earlier it used to take 1-2 seconds to update 100
documents, while now on an average it takes around 50-70 secs. The rate at
which documents are indexed and update requests are generated is more or
less the same, so it cant be that this behaviour is because of more traffic
coming in.
Any suggestions on what might possibly be going wrong?
ES Setup: 10 shards, 1 replica, 10 nodes, 24 GB RAM, 16GB Heap, spinning
disks.

Cheers
Nitish

--

--
Met vriendelijke groet,

Martijn van Groningen

--

Hey Martijn,
Here are the gists:

Anything suspicious here?

--

This is hot_threads gist (just in case): https://gist.github.com/4059726

--

Which version is it?

On Monday, November 12, 2012 9:31:46 AM UTC-5, Nitish Sharma wrote:

This is hot_threads gist (just in case): https://gist.github.com/4059726

--

The available memory heap space balance isn't optimal. Usually a good
balance is to have 50% of the available memory to ES's heapspace and
leave the rest of the memory to ES. In your case I'd set the
ES_HEAP_SIZE option to 12GB on all nodes. The OS itself also needs
enough memory for the filesystem cache.

I also saw in your nodes stats output that memory is swapped. This can
potentially slowdown ES. Swapping needs to be prevented at all time. I
also recommend setting bootstrap.mlockall to true in your
elasticsearch.yml file on all nodes. Make sure that the process is
allowed by the os to lock enough memory.

Also what version of Java are you running?

Martijn

On 12 November 2012 15:31, Nitish Sharma sharmanitishdutt@gmail.com wrote:

This is hot_threads gist (just in case): https://gist.github.com/4059726

--

--
Met vriendelijke groet,

Martijn van Groningen

--

@Igor: This is Elasticsearch version 0.19.10.

@Martijn: I can lower the heap space allocated to Elasticsearch, though
thats unrelated optimization I suppose. Yeah, thats a good catch - swapping
is on. I'll turn that off. Regarding, mlockall config - I have it set to
true in ES config on all nodes with memlock limit set to unlimited in
'limits.conf' for Elasticsearch user, still I always get "Unknown mlockall
error 0" on restarting ES. I have tried all suggested workarounds but none
worked for me, so I left it as is.

Java Version: 1.7.0_09
ES Version: 0.19.10
OS Version: Ubuntu 12.04

--

You are running a recent Java version, so that is good. We also
noticed in the hotthreads api, that there is a thread taking a very
long time. (see line 120). This might explain the sudden performance
fall. This stuck / slow thread is most likely a bug. The best thing
for now is to restart node 3.

Martijn

On 12 November 2012 15:56, Nitish Sharma sharmanitishdutt@gmail.com wrote:

@Igor: This is Elasticsearch version 0.19.10.

@Martijn: I can lower the heap space allocated to Elasticsearch, though
thats unrelated optimization I suppose. Yeah, thats a good catch - swapping
is on. I'll turn that off. Regarding, mlockall config - I have it set to
true in ES config on all nodes with memlock limit set to unlimited in
'limits.conf' for Elasticsearch user, still I always get "Unknown mlockall
error 0" on restarting ES. I have tried all suggested workarounds but none
worked for me, so I left it as is.

Java Version: 1.7.0_09
ES Version: 0.19.10
OS Version: Ubuntu 12.04

--

--
Met vriendelijke groet,

Martijn van Groningen

--

Issue for this bug has been opened:

Martijn

On 12 November 2012 16:09, Martijn v Groningen
martijn.v.groningen@gmail.com wrote:

You are running a recent Java version, so that is good. We also
noticed in the hotthreads api, that there is a thread taking a very
long time. (see line 120). This might explain the sudden performance
fall. This stuck / slow thread is most likely a bug. The best thing
for now is to restart node 3.

Martijn

On 12 November 2012 15:56, Nitish Sharma sharmanitishdutt@gmail.com wrote:

@Igor: This is Elasticsearch version 0.19.10.

@Martijn: I can lower the heap space allocated to Elasticsearch, though
thats unrelated optimization I suppose. Yeah, thats a good catch - swapping
is on. I'll turn that off. Regarding, mlockall config - I have it set to
true in ES config on all nodes with memlock limit set to unlimited in
'limits.conf' for Elasticsearch user, still I always get "Unknown mlockall
error 0" on restarting ES. I have tried all suggested workarounds but none
worked for me, so I left it as is.

Java Version: 1.7.0_09
ES Version: 0.19.10
OS Version: Ubuntu 12.04

--

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--

Restarting node3 did give considerable performance improvement (~5x),
though still not back to the original.
And the issue is closed already; that was fast! But, to be honest, I dont
really understand the problem - some insights on that?. Was it directly
related to the problem I was facing?

--

As the issue describes there was an thread stuck in an infinite loop.
Some exception occurred on in node3 (not necessarily a serious error,
but perhaps a warning error), ES tried to derive the rest code for
that error, which caused an infinite loop.

Martijn

On 12 November 2012 17:29, Nitish Sharma sharmanitishdutt@gmail.com wrote:

Restarting node3 did give considerable performance improvement (~5x), though
still not back to the original.
And the issue is closed already; that was fast! But, to be honest, I dont
really understand the problem - some insights on that?. Was it directly
related to the problem I was facing?

--
Met vriendelijke groet,

Martijn van Groningen

--

Just a note to be careful. I wouldn't recommend to set the lockable memory
to unlimited. Unlimited means really unlimited, and not just the current
process size. A value of unlimited locked RAM for ES might move every other
process or memory buffers that are not mlock'ed to swap, and you don't get
the advantage you are looking for, in fact, your whole system will sooner
or later bog down as your ES resource usage grows. Your smart Linux kernel
seems to prevent you from doing nasty things via setrlimit(). ES is not the
only process running, think of your OS kernel helper processes, the buffer
and the I/O memory etc. that your machine depends on. My recommendation is,
set the lockable memory limit to a maximum of 50%-66% of your RAM, and let
at least 2GB free for non-lockable RAM.

Best regards,

Jörg

On Monday, November 12, 2012 3:56:10 PM UTC+1, Nitish Sharma wrote:

@Igor: This is Elasticsearch version 0.19.10.

@Martijn: I can lower the heap space allocated to Elasticsearch, though
thats unrelated optimization I suppose. Yeah, thats a good catch - swapping
is on. I'll turn that off. Regarding, mlockall config - I have it set to
true in ES config on all nodes with memlock limit set to unlimited in
'limits.conf' for Elasticsearch user, still I always get "Unknown mlockall
error 0" on restarting ES. I have tried all suggested workarounds but none
worked for me, so I left it as is.

Java Version: 1.7.0_09
ES Version: 0.19.10
OS Version: Ubuntu 12.04

--