Timing out while indexing


(Ian Eure) #1

I'm evaluating ElasticSearch as a replacement for Solr, as my data is
getting too large for Solr to cope with comfortably.

I'm importing ~19m documents, each of which is ~512 bytes. I'm using a
two-node ElasticSearch 0.16.1 cluster on EC2 m1.large instances.
They're running on Ubuntu Maverick servers, with Sun Java 1.6.0_21-
b06. I wrote a bulk import script which uses pyes over HTTP. I'm using
the default pyes chunk size of 400 documents.

After indexing around 3m documents, connections to ES time out:

File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 634, in index
self.flush_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 660, in flush_bulk
self.force_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 668, in force_bulk
self._send_request("POST", "/_bulk", self.bulk_data.getvalue())
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 205, in _send_request
response = self.connection.execute(request)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 167, in _client_call
return getattr(conn.client, attr)(*args, **kwargs)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 59, in execute
response =
self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri,
body=request.body, headers=request.headers)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/urllib3/connectionpool.py", line 286, in urlopen
raise TimeoutError("Request timed out after %f seconds" %
self.timeout)
pyes.urllib3.connectionpool.TimeoutError: Request timed out after
5.000000 seconds

There was nothing in the logs at all, so I turned on whatever debug
settings I could find and tried again. Got the same result, and I
didn't see much in the logs. The only thing that looked suspicious was
this:

[2011-05-26 23:46:53,695][DEBUG][index.merge.scheduler ] [Marko,
Cain] [canonical][0] merge [_5yf] done, took [27.3s]

However, this appears a many times (I count 177, but this is for more
than just the import which died) prior to this last time, and none of
those seemed to have an effect.

I'm importing on the same box as I'm running ElasticSearch, so this is
just over localhost; I doubt this is a network issue.

I don't know what's going on, and I'd like to get this working. I
asked on IRC, but nobody responded. How do I go about diagnosing and
solving this problem?


(Shay Banon) #2

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk indexing is creating such a load that it takes more than 5 seconds for some operations (m1.large is not really strong, and IO is probably bad). Not sure also what load the client indexing system that is running on the same machine is generating.

The merge logging you see are logs indicating the internal merge process happening in Lucene (the internal process of maintaing the index files). This merge process can be heavy IO wise.

I would suggest increasing the timeout value, just to see if you do manage to finish your indexing process. You can also try and optimize the index for bulk indexing using the update settings API: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html (see at the bottom).

On Friday, May 27, 2011 at 3:09 AM, Ian Eure wrote:

I'm evaluating ElasticSearch as a replacement for Solr, as my data is
getting too large for Solr to cope with comfortably.

I'm importing ~19m documents, each of which is ~512 bytes. I'm using a
two-node ElasticSearch 0.16.1 cluster on EC2 m1.large instances.
They're running on Ubuntu Maverick servers, with Sun Java 1.6.0_21-
b06. I wrote a bulk import script which uses pyes over HTTP. I'm using
the default pyes chunk size of 400 documents.

After indexing around 3m documents, connections to ES time out:

File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 634, in index
self.flush_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 660, in flush_bulk
self.force_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 668, in force_bulk
self._send_request("POST", "/_bulk", self.bulk_data.getvalue())
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 205, in _send_request
response = self.connection.execute(request)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 167, in _client_call
return getattr(conn.client, attr)(*args, **kwargs)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 59, in execute
response =
self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri,
body=request.body, headers=request.headers)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/urllib3/connectionpool.py", line 286, in urlopen
raise TimeoutError("Request timed out after %f seconds" %
self.timeout)
pyes.urllib3.connectionpool.TimeoutError: Request timed out after
5.000000 seconds

There was nothing in the logs at all, so I turned on whatever debug
settings I could find and tried again. Got the same result, and I
didn't see much in the logs. The only thing that looked suspicious was
this:

[2011-05-26 23:46:53,695][DEBUG][index.merge.scheduler ] [Marko,
Cain] [canonical][0] merge [_5yf] done, took [27.3s]

However, this appears a many times (I count 177, but this is for more
than just the import which died) prior to this last time, and none of
those seemed to have an effect.

I'm importing on the same box as I'm running ElasticSearch, so this is
just over localhost; I doubt this is a network issue.

I don't know what's going on, and I'd like to get this working. I
asked on IRC, but nobody responded. How do I go about diagnosing and
solving this problem?


(Ian Eure) #3

On May 27, 2011, at 4:29 AM, Shay Banon wrote:

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk indexing is creating such a load that it takes more than 5 seconds for some operations (m1.large is not really strong, and IO is probably bad). Not sure also what load the client indexing system that is running on the same machine is generating.

The merge logging you see are logs indicating the internal merge process happening in Lucene (the internal process of maintaing the index files). This merge process can be heavy IO wise.

Okay — But it happens hundreds of other times, for similar lengths of time, and none of those cause timeouts. I'm not convinced that this is the source of the problem, it's just the only thing I saw in the logs that looked interesting.

I would suggest increasing the timeout value, just to see if you do manage to finish your indexing process. You can also try and optimize the index for bulk indexing using the update settings API: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html (see at the bottom).

What I really want is to identify the source of the timeouts, rather than work around the symptom. There's just no way I'm putting anything into production that exhibits stalls like that without fully understanding the circumstances that trigger it and being confident that it isn't going to happen under normal load.


(Jörg Prante) #4

Hi,

just my 2 cents. I'm indexing 17.8 million documents with bulk
indexing, to 3 ES nodes, with the Java API, but pyes should be
comparable. The bulk size is 300 docs of ~10k each and 12 threads in
parallel. It is normal that Lucene steps in with large amounts of I/O
load from time to time. On the nodes (RHEL6) I see spikes up to a load
of 10, 20, 30. Unfortunately the frequence and intensity is not really
predictable. Anyway, you should not bother about the Lucene internals.
Therefore, the bulk indexing needs some dynamic throttling, in order
to adapt to current ES bulk API response times. I have added a
callback mechanism with an upper limit of 30 open bulks, so if
response times get longer, the more the indexer threads have to wait.
It's normal to observe response times up to 20 seconds at Lucene I/O
work time. I found 30 max bulk jobs a sweet spot in my case, but ymmv.
It depends on your memory resource organization in the indexer process
and ES node indexing throughput. A more general solution would be
dynamic throttling on the basis of some reliable throughput statistics
summed up from the nodes, but these are not available (yet).

Jörg

On May 27, 1:29 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk indexing is creating such a load that it takes more than 5 seconds for some operations (m1.large is not really strong, and IO is probably bad). Not sure also what load the client indexing system that is running on the same machine is generating.

The merge logging you see are logs indicating the internal merge process happening in Lucene (the internal process of maintaing the index files). This merge process can be heavy IO wise.

I would suggest increasing the timeout value, just to see if you do manage to finish your indexing process. You can also try and optimize the index for bulk indexing using the update settings API:http://www.elasticsearch.org/guide/reference/api/admin-indices-update...(see at the bottom).

On Friday, May 27, 2011 at 3:09 AM, Ian Eure wrote:

I'm evaluating ElasticSearch as a replacement for Solr, as my data is
getting too large for Solr to cope with comfortably.

I'm importing ~19m documents, each of which is ~512 bytes. I'm using a
two-node ElasticSearch 0.16.1 cluster on EC2 m1.large instances.
They're running on Ubuntu Maverick servers, with Sun Java 1.6.0_21-
b06. I wrote a bulk import script which uses pyes over HTTP. I'm using
the default pyes chunk size of 400 documents.

After indexing around 3m documents, connections to ES time out:

File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 634, in index
self.flush_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 660, in flush_bulk
self.force_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 668, in force_bulk
self._send_request("POST", "/_bulk", self.bulk_data.getvalue())
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 205, in _send_request
response = self.connection.execute(request)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 167, in _client_call
return getattr(conn.client, attr)(*args, **kwargs)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 59, in execute
response =
self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri,
body=request.body, headers=request.headers)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/urllib3/connectionpool.py", line 286, in urlopen
raise TimeoutError("Request timed out after %f seconds" %
self.timeout)
pyes.urllib3.connectionpool.TimeoutError: Request timed out after
5.000000 seconds

There was nothing in the logs at all, so I turned on whatever debug
settings I could find and tried again. Got the same result, and I
didn't see much in the logs. The only thing that looked suspicious was
this:

[2011-05-26 23:46:53,695][DEBUG][index.merge.scheduler ] [Marko,
Cain] [canonical][0] merge [_5yf] done, took [27.3s]

However, this appears a many times (I count 177, but this is for more
than just the import which died) prior to this last time, and none of
those seemed to have an effect.

I'm importing on the same box as I'm running ElasticSearch, so this is
just over localhost; I doubt this is a network issue.

I don't know what's going on, and I'd like to get this working. I
asked on IRC, but nobody responded. How do I go about diagnosing and
solving this problem?


(Martijn Laarman-2) #5

My 2 cents as well

NEST supports builtin semaphoring of connections for this reason

See
https://github.com/Mpdreamz/NEST/blob/master/src/HackerNews.Indexer/Program.cs

For an approach that chops a large data set in small async bulk inserts
using a maximum number of open connections at a time (others are queued to
wait). This really speeds up indexing but there does seem to be a magic spot
after which things start to deteriate.

@kimchy would it help if we could control elasticsearch's auto-commit ?
specifying a single commit after we're done with out bulk inserts to lower
lucene's segment merging ?

On Fri, May 27, 2011 at 10:52 PM, jprante joergprante@gmail.com wrote:

Hi,

just my 2 cents. I'm indexing 17.8 million documents with bulk
indexing, to 3 ES nodes, with the Java API, but pyes should be
comparable. The bulk size is 300 docs of ~10k each and 12 threads in
parallel. It is normal that Lucene steps in with large amounts of I/O
load from time to time. On the nodes (RHEL6) I see spikes up to a load
of 10, 20, 30. Unfortunately the frequence and intensity is not really
predictable. Anyway, you should not bother about the Lucene internals.
Therefore, the bulk indexing needs some dynamic throttling, in order
to adapt to current ES bulk API response times. I have added a
callback mechanism with an upper limit of 30 open bulks, so if
response times get longer, the more the indexer threads have to wait.
It's normal to observe response times up to 20 seconds at Lucene I/O
work time. I found 30 max bulk jobs a sweet spot in my case, but ymmv.
It depends on your memory resource organization in the indexer process
and ES node indexing throughput. A more general solution would be
dynamic throttling on the basis of some reliable throughput statistics
summed up from the nodes, but these are not available (yet).

Jörg

On May 27, 1:29 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk
indexing is creating such a load that it takes more than 5 seconds for some
operations (m1.large is not really strong, and IO is probably bad). Not sure
also what load the client indexing system that is running on the same
machine is generating.

The merge logging you see are logs indicating the internal merge process
happening in Lucene (the internal process of maintaing the index files).
This merge process can be heavy IO wise.

I would suggest increasing the timeout value, just to see if you do
manage to finish your indexing process. You can also try and optimize the
index for bulk indexing using the update settings API:
http://www.elasticsearch.org/guide/reference/api/admin-indices-update...(seeat the bottom).

On Friday, May 27, 2011 at 3:09 AM, Ian Eure wrote:

I'm evaluating ElasticSearch as a replacement for Solr, as my data is
getting too large for Solr to cope with comfortably.

I'm importing ~19m documents, each of which is ~512 bytes. I'm using a
two-node ElasticSearch 0.16.1 cluster on EC2 m1.large instances.
They're running on Ubuntu Maverick servers, with Sun Java 1.6.0_21-
b06. I wrote a bulk import script which uses pyes over HTTP. I'm using
the default pyes chunk size of 400 documents.

After indexing around 3m documents, connections to ES time out:

File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 634, in index
self.flush_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 660, in flush_bulk
self.force_bulk()
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 668, in force_bulk
self._send_request("POST", "/_bulk", self.bulk_data.getvalue())
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/es.py", line 205, in _send_request
response = self.connection.execute(request)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 167, in _client_call
return getattr(conn.client, attr)(*args, **kwargs)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/connection_http.py", line 59, in execute
response =
self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri,
body=request.body, headers=request.headers)
File "/home/ieure/bulkload/lib/python2.6/site-packages/pyes-0.15.0-
py2.6.egg/pyes/urllib3/connectionpool.py", line 286, in urlopen
raise TimeoutError("Request timed out after %f seconds" %
self.timeout)
pyes.urllib3.connectionpool.TimeoutError: Request timed out after
5.000000 seconds

There was nothing in the logs at all, so I turned on whatever debug
settings I could find and tried again. Got the same result, and I
didn't see much in the logs. The only thing that looked suspicious was
this:

[2011-05-26 23:46:53,695][DEBUG][index.merge.scheduler ] [Marko,
Cain] [canonical][0] merge [_5yf] done, took [27.3s]

However, this appears a many times (I count 177, but this is for more
than just the import which died) prior to this last time, and none of
those seemed to have an effect.

I'm importing on the same box as I'm running ElasticSearch, so this is
just over localhost; I doubt this is a network issue.

I don't know what's going on, and I'd like to get this working. I
asked on IRC, but nobody responded. How do I go about diagnosing and
solving this problem?


(Shay Banon) #6

On Friday, May 27, 2011 at 7:15 PM, Ian Eure wrote:

On May 27, 2011, at 4:29 AM, Shay Banon wrote:

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk indexing is creating such a load that it takes more than 5 seconds for some operations (m1.large is not really strong, and IO is probably bad). Not sure also what load the client indexing system that is running on the same machine is generating.

The merge logging you see are logs indicating the internal merge process happening in Lucene (the internal process of maintaing the index files). This merge process can be heavy IO wise.
Okay — But it happens hundreds of other times, for similar lengths of time, and none of those cause timeouts. I'm not convinced that this is the source of the problem, it's just the only thing I saw in the logs that looked interesting.

I would suggest increasing the timeout value, just to see if you do manage to finish your indexing process. You can also try and optimize the index for bulk indexing using the update settings API: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html (see at the bottom).
What I really want is to identify the source of the timeouts, rather than work around the symptom. There's just no way I'm putting anything into production that exhibits stalls like that without fully understanding the circumstances that trigger it and being confident that it isn't going to happen under normal load.
Thats why I said try, I don't know why you get timeouts, or even what this timeout indicates in pyes (maybe aparo can help), i.e. is it a timeout on sending a request, or on connection. I have no idea how your indexing code looks like, and what it does. Is it opening a new connection each time or not (really, localhost is not relevant, it is still a socket)? How many threads / processes do the indexing? Into how many indices? How many shards / replicas do you set for the index?


(Shay Banon) #7

One more thing, how big are the docs? A couple of months ago I helped chase down a similar case (the user was using pyes as well) where suddenly it would timeout, and an average size of docs ~150k. And we found out that every once in a while, there was a doc of ~10mb. And once several of those got into the same bulk, obviously indexing will be slower.

On Saturday, May 28, 2011 at 11:46 AM, Shay Banon wrote:

On Friday, May 27, 2011 at 7:15 PM, Ian Eure wrote:

On May 27, 2011, at 4:29 AM, Shay Banon wrote:

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk indexing is creating such a load that it takes more than 5 seconds for some operations (m1.large is not really strong, and IO is probably bad). Not sure also what load the client indexing system that is running on the same machine is generating.

The merge logging you see are logs indicating the internal merge process happening in Lucene (the internal process of maintaing the index files). This merge process can be heavy IO wise.
Okay — But it happens hundreds of other times, for similar lengths of time, and none of those cause timeouts. I'm not convinced that this is the source of the problem, it's just the only thing I saw in the logs that looked interesting.

I would suggest increasing the timeout value, just to see if you do manage to finish your indexing process. You can also try and optimize the index for bulk indexing using the update settings API: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html (see at the bottom).
What I really want is to identify the source of the timeouts, rather than work around the symptom. There's just no way I'm putting anything into production that exhibits stalls like that without fully understanding the circumstances that trigger it and being confident that it isn't going to happen under normal load.
Thats why I said try, I don't know why you get timeouts, or even what this timeout indicates in pyes (maybe aparo can help), i.e. is it a timeout on sending a request, or on connection. I have no idea how your indexing code looks like, and what it does. Is it opening a new connection each time or not (really, localhost is not relevant, it is still a socket)? How many threads / processes do the indexing? Into how many indices? How many shards / replicas do you set for the index?


(Ian Eure) #8

On May 28, 2011, at 1:46 AM, Shay Banon wrote:

On Friday, May 27, 2011 at 7:15 PM, Ian Eure wrote:

On May 27, 2011, at 4:29 AM, Shay Banon wrote:

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk indexing is creating such a load that it takes more than 5 seconds for some operations (m1.large is not really strong, and IO is probably bad). Not sure also what load the client indexing system that is running on the same machine is generating.

The merge logging you see are logs indicating the internal merge process happening in Lucene (the internal process of maintaing the index files). This merge process can be heavy IO wise.
Okay — But it happens hundreds of other times, for similar lengths of time, and none of those cause timeouts. I'm not convinced that this is the source of the problem, it's just the only thing I saw in the logs that looked interesting.

I would suggest increasing the timeout value, just to see if you do manage to finish your indexing process. You can also try and optimize the index for bulk indexing using the update settings API: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html (see at the bottom).
What I really want is to identify the source of the timeouts, rather than work around the symptom. There's just no way I'm putting anything into production that exhibits stalls like that without fully understanding the circumstances that trigger it and being confident that it isn't going to happen under normal load.

Thats why I said try, I don't know why you get timeouts, or even what this timeout indicates in pyes (maybe aparo can help), i.e. is it a timeout on sending a request, or on connection. I have no idea how your indexing code looks like, and what it does. Is it opening a new connection each time or not (really, localhost is not relevant, it is still a socket)? How many threads / processes do the indexing? Into how many indices? How many shards / replicas do you set for the index?

I'd have to dig further, but what I have noticed is when the timeouts happen, the ES process is completely unresponsive. Connection attempts to the health endpoint via curl time out when it's in this state.

This happens using one single-threaded process. It's using HTTP, so unless pyes is doing pipelining internally, it should be one request per batch. The indexer round-robins between the two nodes in the cluster.

One index, one replica, five shards.

On May 28, 2011, at 2:01 AM, Shay Banon wrote:

One more thing, how big are the docs? A couple of months ago I helped chase down a similar case (the user was using pyes as well) where suddenly it would timeout, and an average size of docs ~150k. And we found out that every once in a while, there was a doc of ~10mb. And once several of those got into the same bulk, obviously indexing will be slower.

They're around 512 bytes each. I don't have hard numbers, but I doubt the size varies significantly. I would be astonished if there was even a single document which was >10k.


(Shay Banon) #9

Which OS are you using in ec2?

On Tuesday, May 31, 2011 at 11:06 PM, Ian Eure wrote:

On May 28, 2011, at 1:46 AM, Shay Banon wrote:

On Friday, May 27, 2011 at 7:15 PM, Ian Eure wrote:

On May 27, 2011, at 4:29 AM, Shay Banon wrote:

Heya,

I see that the timeout is set to 5 seconds. It might be that the bulk indexing is creating such a load that it takes more than 5 seconds for some operations (m1.large is not really strong, and IO is probably bad). Not sure also what load the client indexing system that is running on the same machine is generating.

The merge logging you see are logs indicating the internal merge process happening in Lucene (the internal process of maintaing the index files). This merge process can be heavy IO wise.
Okay — But it happens hundreds of other times, for similar lengths of time, and none of those cause timeouts. I'm not convinced that this is the source of the problem, it's just the only thing I saw in the logs that looked interesting.

I would suggest increasing the timeout value, just to see if you do manage to finish your indexing process. You can also try and optimize the index for bulk indexing using the update settings API: http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html (see at the bottom).
What I really want is to identify the source of the timeouts, rather than work around the symptom. There's just no way I'm putting anything into production that exhibits stalls like that without fully understanding the circumstances that trigger it and being confident that it isn't going to happen under normal load.

Thats why I said try, I don't know why you get timeouts, or even what this timeout indicates in pyes (maybe aparo can help), i.e. is it a timeout on sending a request, or on connection. I have no idea how your indexing code looks like, and what it does. Is it opening a new connection each time or not (really, localhost is not relevant, it is still a socket)? How many threads / processes do the indexing? Into how many indices? How many shards / replicas do you set for the index?
I'd have to dig further, but what I have noticed is when the timeouts happen, the ES process is completely unresponsive. Connection attempts to the health endpoint via curl time out when it's in this state.

This happens using one single-threaded process. It's using HTTP, so unless pyes is doing pipelining internally, it should be one request per batch. The indexer round-robins between the two nodes in the cluster.

One index, one replica, five shards.

On May 28, 2011, at 2:01 AM, Shay Banon wrote:

One more thing, how big are the docs? A couple of months ago I helped chase down a similar case (the user was using pyes as well) where suddenly it would timeout, and an average size of docs ~150k. And we found out that every once in a while, there was a doc of ~10mb. And once several of those got into the same bulk, obviously indexing will be slower.

They're around 512 bytes each. I don't have hard numbers, but I doubt the size varies significantly. I would be astonished if there was even a single document which was >10k.


(Ian Eure) #10

On May 31, 2011, at 1:40 PM, Shay Banon wrote:

Which OS are you using in ec2?

64-bit Ubuntu 10.04 Lucid server.


(Ian Eure) #11

On May 31, 2011, at 1:40 PM, Shay Banon wrote:

Which OS are you using in ec2?

I take that back, they're both on 10.10 Maverick.


(Shay Banon) #12

Damn, was hoping it was 10.04, which is known to have many problems on ec2 which affect thinks like elasticsearch (and cassandra, and redis). I actually just read a tweet by the redis fellow recommending upgrading to Natty because of 10.10 problems on ec2... .

So, its hard to say. It sounds very similar to what we saw (and took a long time to track) on ec2 with ubuntu 10.04, which was process suddenly hanging for a period of time and not responding.

Another option is that too many sockets are being opened, and the OS is throttling the creation of new sockets. You can check that with netstat (I really need to add a stat for that in ES, mental note!). I saw it easily happening with a simple ruby sample code that did not use http keep alive and heavy indexing.

-shay.banon

On Wednesday, June 1, 2011 at 12:18 AM, Ian Eure wrote:

On May 31, 2011, at 1:40 PM, Shay Banon wrote:

Which OS are you using in ec2?
I take that back, they're both on 10.10 Maverick.


(Ian Eure) #13

On May 31, 2011, at 4:05 PM, Shay Banon wrote:

Damn, was hoping it was 10.04, which is known to have many problems on ec2 which affect thinks like elasticsearch (and cassandra, and redis). I actually just read a tweet by the redis fellow recommending upgrading to Natty because of 10.10 problems on ec2... .

Yes, we've encountered this issue before, and migrated away from Lucid.

So, its hard to say. It sounds very similar to what we saw (and took a long time to track) on ec2 with ubuntu 10.04, which was process suddenly hanging for a period of time and not responding.

Another option is that too many sockets are being opened, and the OS is throttling the creation of new sockets. You can check that with netstat (I really need to add a stat for that in ES, mental note!). I saw it easily happening with a simple ruby sample code that did not use http keep alive and heavy indexing.

The symptoms don't seem to match up with this. I was able to get ElasticSearch to hang for quite a while, long enough to fire up htop and see that it was sitting there consuming 100% CPU at the same time as connections were timing out. So the problem doesn't seem to be an issue of being IO/socket bound, but being CPU bound to the point where the process cannot respond.


(Shay Banon) #14

Ok, then there are several ways we can try and solve this:

  1. Track down whats taking this CPU. This will require some low level profiling to see where time is spent.
  2. Try and upgrade to a stronger AWS box. All ec2 users I know are running on m1.xlarge (and above).
  3. (relates to 1) If you can fire a similar test on a local machine, even on your laptop, and see if you get the same behavior. If it happens there, it will be simpler to try and track it down.

-shay.banon

On Wednesday, June 1, 2011 at 2:33 AM, Ian Eure wrote:

On May 31, 2011, at 4:05 PM, Shay Banon wrote:

Damn, was hoping it was 10.04, which is known to have many problems on ec2 which affect thinks like elasticsearch (and cassandra, and redis). I actually just read a tweet by the redis fellow recommending upgrading to Natty because of 10.10 problems on ec2... .
Yes, we've encountered this issue before, and migrated away from Lucid.

So, its hard to say. It sounds very similar to what we saw (and took a long time to track) on ec2 with ubuntu 10.04, which was process suddenly hanging for a period of time and not responding.

Another option is that too many sockets are being opened, and the OS is throttling the creation of new sockets. You can check that with netstat (I really need to add a stat for that in ES, mental note!). I saw it easily happening with a simple ruby sample code that did not use http keep alive and heavy indexing.
The symptoms don't seem to match up with this. I was able to get ElasticSearch to hang for quite a while, long enough to fire up htop and see that it was sitting there consuming 100% CPU at the same time as connections were timing out. So the problem doesn't seem to be an issue of being IO/socket bound, but being CPU bound to the point where the process cannot respond.


(Clinton Gormley) #15

Hi Ian

The symptoms don't seem to match up with this. I was able to get
ElasticSearch to hang for quite a while, long enough to fire up htop
and see that it was sitting there consuming 100% CPU at the same time
as connections were timing out. So the problem doesn't seem to be an
issue of being IO/socket bound, but being CPU bound to the point where
the process cannot respond.

Was any swap being used? I've seen this same issue while indexing,
where memory usage grows, then it tries to GC, but that takes a very
long time because it needs to swap memory in and out from disk in order
to figure out what it can release.

See bootstrap.mlockall on
http://www.elasticsearch.org/guide/reference/setup/installation.html

clint


(Colin Surprenant) #16

Hi,

I am running multiple nodes on 10.10 using m1.large EC2 instances
without any problem.

In bin/service/elasticserach.conf I use the following Java mem config:
set.default.ES_MIN_MEM=1024
set.default.ES_MAX_MEM=3072

which I will increase to 2048, 4096

As been said, 10.04 was very problematic. I am sure you know this but,
if you upgraded from 10.04 to 10.10 you also have to attach the 10.10
kernel to your ami.

Otherwise, did you say you were using the Bulk API? Have you tried
testing with the Index API instead and compare behaviours?

Colin

On Wed, Jun 1, 2011 at 5:26 AM, Clinton Gormley clinton@iannounce.co.uk wrote:

Hi Ian

The symptoms don't seem to match up with this. I was able to get
ElasticSearch to hang for quite a while, long enough to fire up htop
and see that it was sitting there consuming 100% CPU at the same time
as connections were timing out. So the problem doesn't seem to be an
issue of being IO/socket bound, but being CPU bound to the point where
the process cannot respond.

Was any swap being used? I've seen this same issue while indexing,
where memory usage grows, then it tries to GC, but that takes a very
long time because it needs to swap memory in and out from disk in order
to figure out what it can release.

See bootstrap.mlockall on
http://www.elasticsearch.org/guide/reference/setup/installation.html

clint


(Ian Eure) #17

On Jun 1, 2011, at 2:26 AM, Clinton Gormley wrote:

Hi Ian

The symptoms don't seem to match up with this. I was able to get
ElasticSearch to hang for quite a while, long enough to fire up htop
and see that it was sitting there consuming 100% CPU at the same time
as connections were timing out. So the problem doesn't seem to be an
issue of being IO/socket bound, but being CPU bound to the point where
the process cannot respond.

Was any swap being used? I've seen this same issue while indexing,
where memory usage grows, then it tries to GC, but that takes a very
long time because it needs to swap memory in and out from disk in order
to figure out what it can release.

No, I'm on EC2, so there is no swap. I'm using the default settings with bin/elasticsearch, so 1gb max heap. With logs set to DEBUG, I see GC activity, and it is not significant — tens of milliseconds per collection.


(Ian Eure) #18

On Jun 1, 2011, at 7:29 AM, Colin Surprenant wrote:

Hi,

I am running multiple nodes on 10.10 using m1.large EC2 instances
without any problem.

In bin/service/elasticserach.conf I use the following Java mem config:
set.default.ES_MIN_MEM=1024
set.default.ES_MAX_MEM=3072

which I will increase to 2048, 4096

I'll increase the heap size and retest.

As been said, 10.04 was very problematic. I am sure you know this but,
if you upgraded from 10.04 to 10.10 you also have to attach the 10.10
kernel to your ami.

I didn't upgrade, I just bought up new nodes.

Otherwise, did you say you were using the Bulk API? Have you tried
testing with the Index API instead and compare behaviours?

I am using the bulk API. I could see if I can reproduce with the regular index endpoint using multiple loader processes. Testing it with a single process was unacceptably slow, I could only index 24 docs/sec. Bulk gets me to ~1200/sec. I guess if I use 50 processes, though…


(Colin Surprenant) #19

I am using the bulk API. I could see if I can reproduce with the regular index endpoint using multiple loader processes. Testing it with a single process was unacceptably slow, I could only index 24 docs/sec. Bulk gets me to ~1200/sec. I guess if I use 50 processes, though…

using multiple writer threads should easily get you up to a few hundreds/sec.

No, I'm on EC2, so there is no swap

totally off topic but if you do want swap, you can always add a swapfile.

Colin


(Ian Eure) #20

On Jun 1, 2011, at 7:29 AM, Colin Surprenant wrote:

Hi,

I am running multiple nodes on 10.10 using m1.large EC2 instances
without any problem.

In bin/service/elasticserach.conf I use the following Java mem config:
set.default.ES_MIN_MEM=1024
set.default.ES_MAX_MEM=3072

which I will increase to 2048, 4096

As been said, 10.04 was very problematic. I am sure you know this but,
if you upgraded from 10.04 to 10.10 you also have to attach the 10.10
kernel to your ami.

Otherwise, did you say you were using the Bulk API? Have you tried
testing with the Index API instead and compare behaviours?

Updating to say that I'm seeing the same bad behavior with m1.xlarge instances. I gave ElasticSearch 10gb of RAM and started indexing with the bulk API as I was before. It continues to spike to 100% CPU and my connections time out. However, it's able to respond to the _cluster endpoints fine.

Things go fine for ~2m documents, then it starts timing out. It gets worse until I hit ~4m, at which point around 50% of my connections time out and have to be retried.

Any more suggestions? I'm all ears. It floors me that I can't even get my data into it. Can I really be the only person who's come up against this?