We have an ES index with around 1 billion documents. The index is
constantly updated and expanded by 16 spiders. The spiders send bulk
inserts with curl (http api), varying between 1 and 10,000 documents,
depending on the data found by the spiders. Everything is running fine, the
inserts take a few seconds at most. However, once in a while the curl
request will time out after 2 minutes.
I'm trying to find out why this is happening, but I cannot find anything
useful in the logs. I only see some warnings from the garbage collector,
but not at the times the timeouts happen. So I have two questions:
how can I find out why the timeouts are happening (log settings)?
Is this a bad thing, or should I just accept this will happen and
increase the curl timeout settings?
We have an ES index with around 1 billion documents. The index
is constantly updated and expanded by 16 spiders. The spiders
send bulk inserts with curl (http api), varying between 1 and
10,000 documents, depending on the data found by the
spiders. Everything is running fine, the inserts take a few
seconds at most. However, once in a while the curl request will
time out after 2 minutes.
Which bulk requests are timing out? Is it just large ones? How
many bytes are the requests that time out?
It really differs. Sometimes it are small ones, sometimes large ones up to
700KB in size. But when I try to reproduce the timeouts by sending large
bulks (around 24MB) everything seems to be running fine.
Bastiaan
Op donderdag 28 maart 2013 13:54:51 UTC+1 schreef Bastiaan Zijlema het
volgende:
Hello everybody,
We have an ES index with around 1 billion documents. The index is
constantly updated and expanded by 16 spiders. The spiders send bulk
inserts with curl (http api), varying between 1 and 10,000 documents,
depending on the data found by the spiders. Everything is running fine, the
inserts take a few seconds at most. However, once in a while the curl
request will time out after 2 minutes.
I'm trying to find out why this is happening, but I cannot find anything
useful in the logs. I only see some warnings from the garbage collector,
but not at the times the timeouts happen. So I have two questions:
how can I find out why the timeouts are happening (log settings)?
Is this a bad thing, or should I just accept this will happen and
increase the curl timeout settings?
Take care your spiders don't hit the 100 MB HTTP limit which is set in
http.max_content_length and is 100m by default.
Jörg
Am 29.03.13 09:43, schrieb Bastiaan Zijlema:
It really differs. Sometimes it are small ones, sometimes large ones
up to 700KB in size. But when I try to reproduce the timeouts by
sending large bulks (around 24MB) everything seems to be running fine.
Thanks for pointing this out, I did'nt know about this. But when I try to
send more than 100MB I get the error "Recv failure: Connection reset by
peer when sending data", not a timeout. So this doesn't seem to be the
problem causing the timeouts.
Bastiaan
On Friday, March 29, 2013 12:50:40 PM UTC+1, Jörg Prante wrote:
Take care your spiders don't hit the 100 MB HTTP limit which is set in
http.max_content_length and is 100m by default.
Jörg
Am 29.03.13 09:43, schrieb Bastiaan Zijlema:
It really differs. Sometimes it are small ones, sometimes large ones
up to 700KB in size. But when I try to reproduce the timeouts by
sending large bulks (around 24MB) everything seems to be running fine.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.