I have a bulk data loading job that creates an upsert statement and batches
500 of them in a bulk operation using the _bulk interface.
I send the bulk insert via HTTP (on 9200) and wait for the response before
sending the next one, which I do immediately.
I do not hit any thread pool limits.
I have replicas set to zero and refresh interval set to -1 to make the
loading as lightweight as possible.
Timing these, they start out pretty fast and run about 2000 documents per
second. Four or so HTTP round trips.
This lasts for a few minutes and then it starts to slow. Within an hour,
it's running about 1200 per second. In another hour, it's down to about 600
per second. Then it seems to flatten-out about 400 per second until the job
is done, some 8 million documents later.
So my question is - why the slowdown? It's very consistent, seems
reasonably linear, and happens 100% of the time.
Loading 9 million documents starts off at 2000+ per second and, by hour
three, is down to 300 per second. The whole job takes the better part of 8
hours, with this linear slowdown.
FYI it is the weekend still for parts of the world, and we all enjoy our
time off
How many nodes do you have? What is your heap size? Are you monitoring your
system and ES, if so what does it tell you? Have you tried increasing the
bulk count?
Loading 9 million documents starts off at 2000+ per second and, by hour
three, is down to 300 per second. The whole job takes the better part of 8
hours, with this linear slowdown.
There are lot of possible reasons, just to name a few
client program errors
network issues
server issues (too few nodes, query load, tight resources)
improper settings, e.g. for fast segment merge
myriads of new fields
client does not evaluate batch response
etc. etc.
Even 2000+ are ridiculous slow for multithread and multiple nodes. From
your observation that it gets slow after a few minutes, I assume it has to
do with client program errors or improper settings for fast segment merge.
Loading 9 million documents starts off at 2000+ per second and, by hour
three, is down to 300 per second. The whole job takes the better part of 8
hours, with this linear slowdown.
This is probably not related to the slowdown, but when using scripts for
updating docs, it's best to keep the script constant, and use params for
the changing values (all the $vars in your PHP script). This means ES will
compile the script once and reuse that, vs paying compilation cost for
every update.
Do the node logs say anything about index throttling?
Maybe catch and post some hot threads once your'e down to 400 per second?
I have a bulk data loading job that creates an upsert statement and
batches 500 of them in a bulk operation using the _bulk interface.
I send the bulk insert via HTTP (on 9200) and wait for the response before
sending the next one, which I do immediately.
I do not hit any thread pool limits.
I have replicas set to zero and refresh interval set to -1 to make the
loading as lightweight as possible.
Timing these, they start out pretty fast and run about 2000 documents per
second. Four or so HTTP round trips.
This lasts for a few minutes and then it starts to slow. Within an hour,
it's running about 1200 per second. In another hour, it's down to about 600
per second. Then it seems to flatten-out about 400 per second until the job
is done, some 8 million documents later.
So my question is - why the slowdown? It's very consistent, seems
reasonably linear, and happens 100% of the time.
Refactoring my statement from script to a straight update { doc,
upsert_as_doc } seems to have done the trick. So rather than diagnose
what's odd about the script, this has resolved my issue. Yeah, lazy
solution, but a more optimal one
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.