Hi All,
I have an analytic app in development, and I was hoping to get a little advice around configuring elastic for query performance... Here is the scenario:
In all, there are around 42 million docs, in a single index. This number is essentially static; new data may be loaded into it once a year or something, but consider it a fixed data set. Indexing speed is unimportant; query speed is the primary goal...
The data is fairly evenly distributed as follows:
GroupId (an integer, mapped as a keyword) - there are about 3,800 unique values, with a pretty even distribution of docs between them. Every query starts by filtering on a single value.
GroupEx (an integer, mapped as a keyword) - there are anywhere from 10 up to about 50 or 60 of these per GroupId (average is around 30) Used sometimes in TERMS aggs, or for sorting; never for filtering.
SubGroup (an integer, mapped as a keyword) - there are 360 of these, repeated for each GroupEx. Every query filters on a range (maximum is 60). Some queries will sort on this, some queries will do TERMS aggs on it.
Each distinct combination of GroupId, GroupEx, SubGroupID is a single document.
Then, there is a nested structure (with a maximum of 24 objects per 'parent'), consisting of a couple of fields that are ints mapped as keywords, again that may be filtered on, sorted by, and used in TERMS aggs, and a half dozen 'value' fields, each of which is a half_float, and used in min,max, avg aggs, as well as some other statistical aggs.
I was running the app locally on my laptop with about a million docs, and it was very quick indeed. I've now put it on azure, with the full load of 42 million, and it is noticeably slower, with some queries timing out.
I've done the basic config, like setting the bootstrap.memory_lock etc and allocated half the physical ram to ES. In terms of indexing etc, I've just run the whole load of docs in, with no special settings at all; just the basic mapping on the index.
Just wondering (first) whether, given the access patterns above, I should be doing anything maybe with global ordinals, or the filesystem cache, or some other settings in the mappings or anywhere?
Also, it is currently running on a single 4-core 16gb Ubuntu vm (it is in demo mode so I am trying to keep costs low). Would I be better off with, say, 2 or three smaller vms?
Any ideas greatly appreciated
Thanks!
Paul