The trend you see on the left indicates memory pressure. A full GC of oldgen is unable to bring the heap usage to 50% (or lower). As a result, you have a high frequency of rather ineffective full GC operations, because the JVM is trying very hard to keep the heap usage below 70%. The default setting for the JVM in Elasticsearch is to trigger a full GC at 70% heap usage. It appears that yours are triggering a bit higher (75% to 78%), which also indicates memory pressure.
Yes, you are correct in your assumption that the attached quote sums it up. You have 16G heaps on each of 12 data nodes, with 12K shards, or roughly 1K shards per node. This is almost certainly why you are experiencing memory pressure.
The answer to question 1 is that the ongoing operations (caching and index buffering, chiefly) are filling the heap, as designed. It doesn't happen all at once after a restart.
The answer to 2 is "no." Caching and buffering are needed operations to speed things along. Just because they were quickly dumped at a restart doesn't make them (or the data therein) unnecessary.
The short answer to your dilemma here is that you need to address the memory pressure. That can be most easily achieved by either 1) adding more heap, or 2) adding more nodes. Your shards per node count is just too high for the heap you have per node. The shard count cannot be easily reduced otherwise, without upgrading to Elasticsearch v5, which has the Shrink API, which would allow you to collapse a 5 shard index to a 1 shard index—a process usually reserved for cold/stale indices, i.e. those which are no longer indexing new data. This is a somewhat complex process, but it is a useful way to reduce the shard count and thereby reduce memory pressure.