Glad that Kimbro finally commented on the OS cache. The HEAP holds many other caches, but the shard is held in os file cache. The os file cache is not subject to GC or the 32GB theoretical HEAP limit. A node with 256GB of memory, has around 222 GB of fs cache for shards, with a 32 GB HEAP.
While it is ideal to keep shards in memory, it is not imperative. I recently helped a client with around 300GB/shard. Their primary reason for the engagement was not scale or performance, but they were running pre 1.x ES, and needed to upgrade. My first recommendation was to reduce the shard size. It is not required to have the active dataset in RAM. Of course an index not in RAM will be significantly slower, but ES is still fast.
If the active data is not in RAM, the performance of the disk becomes critical. This is one of the reasons, aside from ingestion, that SSD based clusters are becoming common. This client like many of us had time based data, which is easy to segment, and most queries run against the most recent data, I.E. last 24 hours, last 7 days. Occasionally someone needs to run a 90 day or 1 year query and it is just not possible to keep that much data in RAM for most budgets. So using SSD keeps these queries fast, while not incurring the expense of RAM and additional nodes.
There are so many issues with this single index premise, many have been covered, but one that does not seem to have been mentioned directly, is growth. As was mentioned ES cannot split shards like SOLR can. So when your index with 100 shards, needs to grow beyond 100 nodes, you will need to reindex everything, in one shot. Also if your shards grow beyond the theoretical 32GB ceiling you have placed on them, you will need to increase the number of shards. However, if you are using a partitioning scheme then you don't have the same issues. You can reindex partitions independently if needed. If you are using time based data you create new indexes based on time period the data naturally partitions.
It is unreasonable to expect ES to identify the ideal partitioning of your data. ES has done it's part in that regard by providing the Index over Shards concept. The shards partition the data arbitrarily and are grouped into a logical dataset. If you understand that an index is nothing more than a logical grouping of shards, and that ES provides other mechanisms to group shards such as aliases, then it starts to make sense that there is little difference between a single index that has 100 shards and 10 indexes that each have 10 shards from the perspective you are looking at this. Yes you would have to do a little work upfront to distribute incoming documents, but that does not require an army of librarians as you have asserted. Typically with some effort a natural key, which in result will improve query times, can be identified. If not you can do similar to what ES does and hash. If you use a consistent hash, you will be able to minimize the impact of the eventual resizing, as fewer documents will need to move shards. Explaining this fully is well beyond the scope of this reply, but the point is that partitioning need not be a manual process.
As already said Mongo does require partitioning determined by the user. Partitioning and finding the right natural key was one of the major topics in all of the MongoDB World events that I attended. Like ES routing, your MongoDB key has significant distribution and performance implications. To scale beyond a single MongoDB node, the same exercise is required, a shard key must be identified and the data must be sharded. This must be defined by the user equivalent to ES routing, with the same caveats. The big difference with ES is the first stage of this happens automatically.
I understand your replica comment and 2x RAM, keeping 2x the number of shards in memory requires 2x memory. However, this does improve concurrent query performance. In heavy query environments it is not uncommon to have replica = number of nodes, so that every node has every shard. The point that everyone else was making I think is replicas do not need to be in RAM, if the primary is in RAM, so replica=1 does not necessarily mean you need 2x RAM.
To summarize, the 100 node limit is incorrect, there are many running larger clusters, this is a networking consideration and there are network infrastructures that will support the traffic of more than 100 nodes. The 31GB per shard limit is incorrect and is only truly limited by your disk size, but is performance limited by RAM less HEAP. People are running clusters with 10000s of shards. Index is a logical grouping of shards. If a cluster can scale to 10000s of shards, then a single index can also. It is trivial to scale an ES cluster beyond 3TB, don't do this in a single index. Partitioning the data into is trivial, and does not require manual work. Searching across multiple indexes is trivial, typically automatic.