A generic question regarding ElasticSearch / Lucene. I have a lot of indexes, and they are very large. I believe we have about 3 Billion documents that need to be searched very quickly. The data has a lot of fields (10-20) per document, and can range from 1 - 64kb in size. Not all fields are indexed, and a lot of them are stored. I believe we have 5-8 fields that are indexed. We also have some fields which provide term vectors.
Keeping cached results is not very useful. Searches usually do not repeat on the cluster, so caching frequent searches do not help. These indexes together represent 2 years of data, so there is a lot of random access around the shards. When we were using Solr we had a lot of disk access because we could not warm caches as we did not know what searches were going to be performed. These indexes are for research, and very ad-hoc.
My question is this. Would Solid State Disks provide the best result latency since I'm going to have to fetch the documents from disk no matter what? I'm thinking of a 16 node ElasticSearch cluster with 48GB of RAM per node, and 6x 300GB + SSD disks with a great RAID controller (LSI 9280, Areca 1880i, etc..). Put them in RAID0 and have a good replication factor on the cluster. Would this be a good recommendation? I'm looking for ways to scale the performance of the cluster, and adding a ton of rotating hard disks does not seem feasible to get better query times.