Timing out while indexing

Updating to say that I'm seeing the same bad behavior with m1.xlarge instances. I gave Elasticsearch 10gb of RAM and started indexing with the bulk API as I was before. It continues to spike to 100% CPU and my connections time out. However, it's able to respond to the _cluster endpoints fine.

Things go fine for ~2m documents, then it starts timing out. It gets worse until I hit ~4m, at which point around 50% of my connections time out and have to be retried.

Any more suggestions? I'm all ears. It floors me that I can't even get my data into it. Can I really be the only person who's come up against this?

Did you check the _cluster/health when it is timing out?

On my side, when I do bulk indexing, the cpu does stay high between
70% and 100% and the load goes way up too but ES does not "freeze".

Colin

On Jun 1, 2011, at 3:05 PM, Colin Surprenant wrote:

Updating to say that I'm seeing the same bad behavior with m1.xlarge instances. I gave Elasticsearch 10gb of RAM and started indexing with the bulk API as I was before. It continues to spike to 100% CPU and my connections time out. However, it's able to respond to the _cluster endpoints fine.

Things go fine for ~2m documents, then it starts timing out. It gets worse until I hit ~4m, at which point around 50% of my connections time out and have to be retried.

Any more suggestions? I'm all ears. It floors me that I can't even get my data into it. Can I really be the only person who's come up against this?

Did you check the _cluster/health when it is timing out?

Nothing unusual there. Is there something specific I should be looking for?

{
"cluster_name" : "mortimer",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

On my side, when I do bulk indexing, the cpu does stay high between
70% and 100% and the load goes way up too but ES does not "freeze".

I'm fine with load and CPU and all that. What I'm not fine with are the timeouts and the massive blow to index performance that they cause. The more documents I index, the higher the error rate, the slower the import, until I'm only indexing 150/sec. At that rate, it's going to take around two days to index my whole dataset.

On Jun 1, 2011, at 11:41 AM, Colin Surprenant wrote:

I am using the bulk API. I could see if I can reproduce with the regular index endpoint using multiple loader processes. Testing it with a single process was unacceptably slow, I could only index 24 docs/sec. Bulk gets me to ~1200/sec. I guess if I use 50 processes, though…

using multiple writer threads should easily get you up to a few hundreds/sec.

Following up. Using gobs of processes and non-bulk indexing makes no difference. With 40 workers, I'm only able to index 400 docs/sec, and there are still frequent timeouts. If I increase the number of workers, Elasticsearch seems to cop with it poorly and my error rate increases, leaving performance around the same place.

Nothing unusual there. Is there something specific I should be looking for?

{
"cluster_name" : "mortimer",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Obviously, any non green status would indicate a cluster problem
definitely affecting performance.

We are kind of shooting in the dark here. Here's a list of stuff I'd
do to try to eliminate variables and at least try to isolate the
problem origin "area".

  • make sure its not OS related. my node are running 64bits 10.10 using
    kernel aki-427d952b - 2.6.35-25-virtual for X86_64. If its not the
    same kernel you are running, maybe you want to give it a spin?

  • make sure its not Java related. I'm running on:
    java version "1.6.0_22"
    Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
    Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

  • make sure its not related to your import script + libs. can you
    reproduce using a single dummy document that you index as-nauseam
    using the most barebone http calls?

  • try and setup a single node ES and see if it behave the same way

...

Colin

Hi Ian

Updating to say that I'm seeing the same bad behavior with m1.xlarge
instances. I gave Elasticsearch 10gb of RAM and started indexing with
the bulk API as I was before. It continues to spike to 100% CPU and my
connections time out. However, it's able to respond to the _cluster
endpoints fine.

Things go fine for ~2m documents, then it starts timing out. It gets
worse until I hit ~4m, at which point around 50% of my connections
time out and have to be retried.

What gateway are you using? Local (default) or shared?

I'm wondering if what you're seeing is slow disk IO.

I'd suggest tracking what is happening with your memory while indexing.
A useful tool for that is:

Download to your local machine and open up the ./demo/stats/stats.html
file.

At the same time, keep an eye on your IO with iotop

Hopefully this'll give you a better picture of what is happening
internally

clint

Hm. Bulk and non-bulk indexing makes a huge difference here.

I'm not familiar with EC2 instances. But still it sounds very much
system I/O related when Lucene steps in and does the indexing. Maybe I
have some more hints to track down the issue.

iostat will give you more info about the I/O going and mpstat about
iowait states. You should see periods of "100% busy" if you index to a
local disk. The faster your disk hardware can handle the Lucene I/O,
the sooner this period is over. It can take some minutes if you
operate with some GB filled JVM heap.

Small systems can even completely stall under heavy I/O. In that case
you should reduce shards and replica and restart indexing with the
smaller setup. Because this reduces the number of Lucene instances,
this should reduce some CPU and I/O load, and the system might respond
better, but ES will spend a lot more time on indexing and search.

If iowaits are permanently staying high and you persist to a local
disk, your disk is too slow. It could be the FS or the throughput of
the drives. If possible at all, ensure you have disabled FS access
time by mounting the FS with "noatime" option. You can test raw I/O
rates of the drives with dd and some high block size and count values
and compare these with some exerciser results like bonnie++.

If you persist to a network based backend, I don't have much advice.
EC2 instances may only offer 100 Mbit interfaces. That would be
another reason for the stalls, because of network congestions.

Jörg

On Jun 2, 12:07 pm, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Ian

Updating to say that I'm seeing the same bad behavior with m1.xlarge
instances. I gave Elasticsearch 10gb of RAM and started indexing with
the bulk API as I was before. It continues to spike to 100% CPU and my
connections time out. However, it's able to respond to the _cluster
endpoints fine.

Things go fine for ~2m documents, then it starts timing out. It gets
worse until I hit ~4m, at which point around 50% of my connections
time out and have to be retried.

What gateway are you using? Local (default) or shared?

I'm wondering if what you're seeing is slow disk IO.

I'd suggest tracking what is happening with your memory while indexing.
A useful tool for that is:

GitHub - lukas-vlcek/elasticsearch-js: ElasticSearch Javascript client

Download to your local machine and open up the ./demo/stats/stats.html
file.

At the same time, keep an eye on your IO with iotophttp://guichaz.free.fr/iotop/

Hopefully this'll give you a better picture of what is happening
internally

clint

You should try the same operation on a different box, preferably on
your development machine, as suggested by Shay. Because yes, you seem
to be the only person to have such massive issues.

Karel

On Jun 1, 11:12 pm, Ian Eure i...@simplegeo.com wrote:

Any more suggestions? I'm all ears. It floors me that I can't even get my data into it. Can I really be the only person who's come up against this?

On Jun 2, 2011, at 3:07 AM, Clinton Gormley wrote:

Hi Ian

Updating to say that I'm seeing the same bad behavior with m1.xlarge
instances. I gave Elasticsearch 10gb of RAM and started indexing with
the bulk API as I was before. It continues to spike to 100% CPU and my
connections time out. However, it's able to respond to the _cluster
endpoints fine.

Things go fine for ~2m documents, then it starts timing out. It gets
worse until I hit ~4m, at which point around 50% of my connections
time out and have to be retried.

What gateway are you using? Local (default) or shared?

I don't know for sure (how do I find out?), but I'm guessing the default, because I have tweaked nearly nothing.

I'm wondering if what you're seeing is slow disk IO.

This is likely part of the problem, as EC2 is not known for its IO. I'm running a RAID-0 on top of two ephemeral disks to mitigate.

I'd suggest tracking what is happening with your memory while indexing.
A useful tool for that is:

GitHub - lukas-vlcek/elasticsearch-js: ElasticSearch Javascript client

Download to your local machine and open up the ./demo/stats/stats.html
file.

At the same time, keep an eye on your IO with iotop
Iotop's homepage

Haven't gotten elasticsearch-js up yet, it's somewhat difficult due to the firewalling in front of my machines.

Watching iostat / vmstat while operating normally, I see:

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
md0 624.40 738.40 4900.00 7384 49000
md0 1632.40 17083.20 10832.80 170832 108328

When it starts timing out, this jumps an order of magnitude:

avg-cpu: %user %nice %system %iowait %steal %idle
20.34 0.00 4.69 14.79 3.40 56.78
md0 10670.80 8823.20 84192.80 88232 841928

Here's what I see in vmstat when operating normally:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 305812 59604 6241392 0 0 1397 687 453 359 14 2 80 1
1 0 0 357060 59832 6177340 0 0 1178 6918 62629 65657 22 3 72 1
2 0 0 271896 60024 6255956 0 0 1178 2614 67569 71539 21 3 72 0
3 1 0 280848 60196 6232828 0 0 1075 10085 67143 72779 22 4 69 2

And here's when I start timing out:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 0 0 251576 63456 6251448 0 0 24144 7642 1359 849 39 7 29 15
5 1 0 953712 63580 5549256 0 0 35452 19106 2105 1492 40 7 26 18
3 1 0 613028 63644 5889864 0 0 21140 63902 1872 1326 15 2 39 41
1 3 0 76468 63784 6423144 0 0 70484 20767 2352 1050 28 5 45 16
2 1 0 75888 63824 6424964 0 0 37152 59548 3813 3459 4 3 67 25
0 3 0 71752 63808 6429480 0 0 35920 65954 5598 5551 3 3 70 24
1 2 0 74024 62136 6430184 0 0 82711 33539 2331 493 7 5 63 24
4 0 0 78324 62080 6427240 0 0 26184 14455 4884 4463 14 4 69 10
2 1 0 383120 62260 6122188 0 0 11711 27705 16526 17202 29 5 49 11

You can see that processes start blocking, presumably due to the IO. Also interesting that free memory seems to take a sharp nosedive around the same time.

iotop shows [flush-9:0[ spiking to ~90% IO at times, but I don't see a pattern with the timeouts. I have seen it spike to 90% and indexing continues with no errors. I haven't caught it right as the timeouts happen, so I guess it might be pegged at 100% for the 5s of the timeout.

So, that's great, IO load is making it time out, which isn't a huge surprise. The real issue here is: WHAT makes the IO spike like that, and how can I tune ES so it stops, at least for indexing? There is nothing at all in the Elasticsearch logs to indicate what it's doing that triggers the IO (though I am not running them at DEBUG for this import). I honestly don't know if I can trust the /_cluster endpoints — I had a box wedge over the weekend, any query or index operation was timing out, but the health endpoint said everything was fine. While I can observe what the box is doing, I have no visibility into what Elasticsearch does to trigger those behaviors, so no idea where to do to start twiddling knobs to make it stop.

On Jun 2, 2011, at 4:54 AM, jprante wrote:

Hm. Bulk and non-bulk indexing makes a huge difference here.

I'm not familiar with EC2 instances. But still it sounds very much
system I/O related when Lucene steps in and does the indexing. Maybe I
have some more hints to track down the issue.

iostat will give you more info about the I/O going and mpstat about
iowait states. You should see periods of "100% busy" if you index to a
local disk. The faster your disk hardware can handle the Lucene I/O,
the sooner this period is over. It can take some minutes if you
operate with some GB filled JVM heap.

So is it better to restrict the heap size? It seemed to struggle more with the smaller heap, but I bumped the instance size at the same time as I gave it more.

Small systems can even completely stall under heavy I/O. In that case
you should reduce shards and replica and restart indexing with the
smaller setup. Because this reduces the number of Lucene instances,
this should reduce some CPU and I/O load, and the system might respond
better, but ES will spend a lot more time on indexing and search.

I'm using the default of five shards, one replica. CPU doesn't seem to be a problem.

Here's a snapshot of elasticsearch-js showing a timeout and recovery: http://i.imgur.com/lte5z.png

If iowaits are permanently staying high and you persist to a local
disk, your disk is too slow. It could be the FS or the throughput of
the drives. If possible at all, ensure you have disabled FS access
time by mounting the FS with "noatime" option. You can test raw I/O
rates of the drives with dd and some high block size and count values
and compare these with some exerciser results like bonnie++.

I've seen iowait spike to around 18%, but it's <1% for most of the indexing.

If you persist to a network based backend, I don't have much advice.
EC2 instances may only offer 100 Mbit interfaces. That would be
another reason for the stalls, because of network congestions.

I'm not 100% sure what you mean by that, but I guess you're asking if the index is on a block device that requires network access. The answer to that question is no; I'm using a RAID-0 stripe of instance-store devices, which are as close to real disks as you can get on EC2.

  • deleted -