CPU usage doubles and does not return to normal

I have a 6 node cluster (2.3.3 java 1.8.0_91) with only 3 nodes storing data. we are feeding our syslog stream into the nodes, so the load is fairly constant and consistent. After a few days of running, the CPU on the storage nodes jumps from approx 25% to approx 50%. The last occurrence was at Jun 8 21:50 (we keep cpu stats on 5 second intervals). The only thing in the log at that time (or even near that time) are

2016-06-08T21:45:05 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.08.21] update_mapping [flare]
2016-06-08T21:45:06 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.08.21] update_mapping [flare]
2016-06-08T21:48:02 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.09.04] update_mapping [flare]
2016-06-08T21:49:50 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.08.21] update_mapping [flare]
2016-06-08T21:49:50 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.08.21] update_mapping [flare]
2016-06-08T21:50:33 tetra local3.info elasticsearch: [es-dcs-dev] [monitor.jvm] [es-tetra] [gc][old][308903][317] duration [5.3s], collections [1]/[6.3s], total [5.3s]/[19.3s], memory [3.1gb]->[2.2gb]/[3.9gb], all_pools {[young] [154.2mb]->[21.4mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [2.9gb]->[2.2gb]/[3.6gb]}
2016-06-08T21:50:38 peto local3.info elasticsearch: [es-dcs-dev] [monitor.jvm] [es-peto] [gc][old][308933][320] duration [6.2s], collections [1]/[7.1s], total [6.2s]/[20.9s], memory [2.9gb]->[2.3gb]/[3.9gb], all_pools {[young] [15.3mb]->[19.5mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [2.9gb]->[2.3gb]/[3.6gb]}
2016-06-08T21:50:43 buri local3.info elasticsearch: [es-dcs-dev] [monitor.jvm] [es-buri] [gc][old][309053][320] duration [5.3s], collections [1]/[6s], total [5.3s]/[19.2s], memory [3gb]->[2.4gb]/[3.9gb], all_pools {[young] [5mb]->[834.1kb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [3gb]->[2.4gb]/[3.6gb]}
2016-06-08T21:51:28 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.08.21] update_mapping [flare]
2016-06-08T21:57:32 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.09.13] creating index, cause [auto(bulk api)], templates [flare_template], shards [3]/[0], mappings [flare]
2016-06-08T21:57:32 peto local3.info elasticsearch: [es-dcs-dev] [cluster.routing.allocation] [es-peto] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[flare-2016.06.09.13][1], [flare-2016.06.09.13][2], [flare-2016.06.09.13][0]] ...]).
2016-06-08T21:57:32 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.09.13] update_mapping [flare]
2016-06-08T21:58:03 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.08.21] update_mapping [flare]
2016-06-08T21:59:25 peto local3.info elasticsearch: [es-dcs-dev] [cluster.metadata] [es-peto] [flare-2016.06.08.22] update_mapping [flare]

so it looks like garbage collection, but those log lines indicate that gc completed in mere seconds.

Has anyone else seen this behavior?
How do I find out what the cluster is doing with all of that extra CPU?

WE are using Kibana 4.5.1 and we are new to this so we may have made a poor query. I just want to know how to track it down and cancel it if it is a poor query.

Any help is appreciated.

Capture hot threads to see what the CPUs are doing?

Mike McCandless

Thanks. That's a good place to start for me. Next time it happens I can gather a little more info. 400,000,000 documents a day. This is amazing to me.