Indexing can be CPU-heavy due to merging, which is effectively a streaming mergesort + other functions. Most folks end up IO bound, but the ratio depends on the documents themselves (how many fields, how complex the analysis, etc). The CPU ratio can shift if you have complex docs.
Note, however, that your 3.7/4 load metric doesn't necessarily mean you're CPU-bound. Since it's still under 4, you're technically not running at max capacity (e.g. there are on average 3.7 processes wanting to use 4 cores at any given moment). An over-load would be something like 6/4, meaning 6 processes are contending for four cores at any particular moment.
Also, from the docs:
/proc/loadavg The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) aver‐ aged over 1, 5, and 15 minutes. They are the same as the load average numbers given by uptime(1) and other programs.
You'll note that processes waiting on Disk IO are included, so you can see high load averages even with little CPU burn, because all the cores are waiting on IO to return.
A better metric would be to monitor CPU utilization as well as Disk utilization, such as watching IOPs and throughput. I'd hazard a guess that your disk probably are the bottleneck, and you'll see them chugging away at their close-to-max sequential throughput or IOPs.
Related to your second question, ES throttles indexing at the Lucene level. If it finds that index merging is not keeping up with indexing rate, it will automatically throttle back which causes backpressure and an increase in the queued bulk threads. So from an ES side, it should automatically prevent indexing from swamping queries.
On your side, you can try to provide a slower feed of bulks to ES, or attempt to smooth out bursts (e.g. instead of sending all the bulks at once, queue them up in your app and feed them to ES at a constant rate).