Lots of guesstimating took place, since I'm very new to ES :).
Basically I checked the running tasks. On cursory inspection they just showed the task query task itself, but I wasn't convinced that nothing else was happening when the percentage of cpu used was so high. Since the cpu usage of a task that top shows is just an average over the past polling period, my task queries were probably hitting the idle times of the ES process. When I then fired off the task query as quickly as possible bulk indexing jobs appeared every once in a while.
Since my ES instance is idle 99% of time time (an installation in my dev environment) the only other thing that was periodically accessing ES was x-pack with monitoring information. After turning off the monitoring both in elasticsearch.yml and kibana.yml, the CPU usage dropped.
From the symptoms you're describing, it might also be a number of other things:
- Data shuffled between shards on the same node (might try to raise the throtteling limit and see if it gets better)
- Your working set doesn't fit into the java heap, and ES needs to continuously swap out documents on heap and redo all the caching for the newly swapped in entries
- Maybe some Java GC edge case is being hit, where it tries to free heap, but nothing is freeable so it goes into a loop? (I don't know enough about Java and its GCs to be able to say anything here for certain, just a guess. Although I'd be surprised if this were actually the case, since the Java GCs are quite mature and the ES devs would have tuned the java GC for performance).
Did you check what the disk IO looks like during the periods of high CPU usage? Check either with iotop, iostat or vmstat. iotop -o will actually show you the processes performing IO and iostat -ktx 3 will show how the IO subsystem is performing. vmstat will just show the blocks read and written in the bin and bout columns, but it gives a good first impression of how busy the system is with disk IO.