Optimal Memory allocation for Single Read Only, Single Shard index

Hi - I have a very simple index that I want to optimise for search speed. It is a 23gb read-only index - queries are randomly distributed across all documents (i.e. there are no real hot spots) and with the vast majority being standard match searches with filters and the odd geo query (bounding box). I am looking at the best way to fully optimise this for near real-time search. Presumably "index.store.type": "memory" will give a big uplift? Also, I know that a general rule of thumb is to allocate 50% of available RAM to ES and leave 50% for the OS but would this still be the case for a read-only index?

I am looking at AWS r3 instance types which give the most memory for $. There is an r3.xlarge with 31.5gb RAM or an r3.2xlarge with 61gb RAM. Based on the configuration above and the index size of 23gb which is unlikely to grow would there be much advantage going for the r3.2xlarge - they are double the price so if I can squeeze on the smaller instance that would be awesome.

Thanks in advance for any advice - I have been trawling the forums and SO but not found anything with quite the same requirements.

Don't use this, it's being removed in 2.0 in favour of better options like RAM disks at the OS level.

Yes because the OS can still cache segment files.

Well you could, theoretically, fit the entire index into memory thanks to the OS caching stuff.

1 Like

Thanks for the advice Mark, much appreciated - my OS is Ubuntu and following advice elsewhere and your recommendation not to use memory (I see this is removed in 2.0 beta) I am using index.store.type: mmapfs. As far as I can see the only real difference for read only would be the refresh rate as data is not changing. In theory I guess I could set this to -1 to disable completely but for now I have reduced it to 30s from 1s. Does this make sense? Would there be major benefits in making this much longer, say 1d or disable completely?

I have also been reading up on some of the memory advice in the docs and particularly around fielddata. Given that I am not performing aggregations generally I understand that fielddata will really only be used for sorting and geo-queries, therefore for regular full text search fields I have disabled fielddata and for those used in sorting or geo search I have set fielddata loading to be eager with string fielddata eager_global_ordinals. I also plan on warming my filter cache. There is a natural index in my data (UK postcode) which will be used in the majority of queries with a finite number of combinations (1.7 million). If I set my refresh level low is there any benefit/harm in setting up warmers to cover all possible options?

Thanks again for your reply, Mark

Higher makes more sense and 30s is fine. I probably wouldn't disable it as someone could send data and then sit around waiting for it to be searchable, which won't happen. You can dynamically change this on a per index basis too.

No, but if you warm too much you may run out of usable heap!
You'd still be better off using doc values as much as possible and then allowing that to be cached by the OS. Don 't forget we will default to doc values for any unanalysed field in 2.0.

Thanks Mark - I've just watched the performance webinar which was very interesting. Chris Earle makes a good case for doc values however in my situation where I have absolute certainty over data volumes and can therefore tune memory usage up front does it still make sense to use doc values over field data? Will field data give me better performance for my sorts?

Not sure I am qualified to comment on that :stuck_out_tongue:

But, doc values is the way of the future.

1 Like