On Jun 2, 2011, at 3:07 AM, Clinton Gormley wrote:
Hi Ian
Updating to say that I'm seeing the same bad behavior with m1.xlarge
instances. I gave Elasticsearch 10gb of RAM and started indexing with
the bulk API as I was before. It continues to spike to 100% CPU and my
connections time out. However, it's able to respond to the _cluster
endpoints fine.
Things go fine for ~2m documents, then it starts timing out. It gets
worse until I hit ~4m, at which point around 50% of my connections
time out and have to be retried.
What gateway are you using? Local (default) or shared?
I don't know for sure (how do I find out?), but I'm guessing the default, because I have tweaked nearly nothing.
I'm wondering if what you're seeing is slow disk IO.
This is likely part of the problem, as EC2 is not known for its IO. I'm running a RAID-0 on top of two ephemeral disks to mitigate.
I'd suggest tracking what is happening with your memory while indexing.
A useful tool for that is:
GitHub - lukas-vlcek/elasticsearch-js: ElasticSearch Javascript client
Download to your local machine and open up the ./demo/stats/stats.html
file.
At the same time, keep an eye on your IO with iotop
Iotop's homepage
Haven't gotten elasticsearch-js up yet, it's somewhat difficult due to the firewalling in front of my machines.
Watching iostat / vmstat while operating normally, I see:
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
md0 624.40 738.40 4900.00 7384 49000
md0 1632.40 17083.20 10832.80 170832 108328
When it starts timing out, this jumps an order of magnitude:
avg-cpu: %user %nice %system %iowait %steal %idle
20.34 0.00 4.69 14.79 3.40 56.78
md0 10670.80 8823.20 84192.80 88232 841928
Here's what I see in vmstat when operating normally:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 305812 59604 6241392 0 0 1397 687 453 359 14 2 80 1
1 0 0 357060 59832 6177340 0 0 1178 6918 62629 65657 22 3 72 1
2 0 0 271896 60024 6255956 0 0 1178 2614 67569 71539 21 3 72 0
3 1 0 280848 60196 6232828 0 0 1075 10085 67143 72779 22 4 69 2
And here's when I start timing out:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 0 0 251576 63456 6251448 0 0 24144 7642 1359 849 39 7 29 15
5 1 0 953712 63580 5549256 0 0 35452 19106 2105 1492 40 7 26 18
3 1 0 613028 63644 5889864 0 0 21140 63902 1872 1326 15 2 39 41
1 3 0 76468 63784 6423144 0 0 70484 20767 2352 1050 28 5 45 16
2 1 0 75888 63824 6424964 0 0 37152 59548 3813 3459 4 3 67 25
0 3 0 71752 63808 6429480 0 0 35920 65954 5598 5551 3 3 70 24
1 2 0 74024 62136 6430184 0 0 82711 33539 2331 493 7 5 63 24
4 0 0 78324 62080 6427240 0 0 26184 14455 4884 4463 14 4 69 10
2 1 0 383120 62260 6122188 0 0 11711 27705 16526 17202 29 5 49 11
You can see that processes start blocking, presumably due to the IO. Also interesting that free memory seems to take a sharp nosedive around the same time.
iotop shows [flush-9:0[ spiking to ~90% IO at times, but I don't see a pattern with the timeouts. I have seen it spike to 90% and indexing continues with no errors. I haven't caught it right as the timeouts happen, so I guess it might be pegged at 100% for the 5s of the timeout.
So, that's great, IO load is making it time out, which isn't a huge surprise. The real issue here is: WHAT makes the IO spike like that, and how can I tune ES so it stops, at least for indexing? There is nothing at all in the Elasticsearch logs to indicate what it's doing that triggers the IO (though I am not running them at DEBUG for this import). I honestly don't know if I can trust the /_cluster endpoints — I had a box wedge over the weekend, any query or index operation was timing out, but the health endpoint said everything was fine. While I can observe what the box is doing, I have no visibility into what Elasticsearch does to trigger those behaviors, so no idea where to do to start twiddling knobs to make it stop.