I want to vet my setup/plan with you to check it is sane and optimal, and
see if you have any ideas or suggestions. I'm not currently not using
elasticsearch, so this is my first attempt.
I'd like to index 300 million (100GB) documents a day. I'll keep this data
for 30 days. I mostly need filtering sorted by publish date (which could be
several hours behind indexing date). And I'd like to use percolate if it
works well with 3500/sec docs.
The plan is to create 6 hour indexes, with 6 shards and 1 replica each.
Data coming in will be added to the appropriate index by its publish date.
Although there will be some stragglers coming in up to a day after it is
published, most of the data will be inserted into the "newest" 2-3 indexes
(latest 12 hours).
I was thinking that we'd keep the latest 2-3 indexes memory-based. Once we
roll onto a new 6 hour index we would update the settings realtime for the
oldest of the 3 memory-based indexes to be file-based. I would think the
memory-based indices will help with the high insert rate, but I'm unsure
whether I can convert from a memory-based index to a file-based index
realtime and what performance implications this will have.
Every 6 hours we'll drop/delete the 121st index after we create a new index
which keeps us at 30 days. We'll also batch inserts to ~1mb batches of
docs. Some of the documents are larger, many are smaller, so this should
keep it more consistent than bulking by # of documents.
I'm going to run a test cluster using 3-6 4XL high-memory instances (68GB
memory, 1700GB storage).
Do you think this sounds like a good way to tackle this?
Is there anything we should do when we bulk load the initial 30 days of
data (turn off replicas or commits or something)?
Can you convert memory-store to file-store realtime?
Does bulking by 1mb batches sound reasonable? Should that be more or less?
Any other problems or optimizations you can see?
I figured this is not a new problem and there are others who can explain
how this should be done. So thank you for your time!