ES noob here. I'm using ES as our main storage but facing a dilemma. We store service data up to 36 hours, our first design is to make hourly indices, so the whole index could be deleted once it's expired. Each service has a unique ID and both new and existing services would be written/updated to ES every 10 minutes.
The problem is that if a service exists across the hours, say index1 and index2 and we ask for 2 hours of data, the query would return duplicates coming from both indices. Our query is paginated so we are not fetching all records, which makes the dedupe impossible. We created a workaround that adds a flag to the docs in previous index if current index has the same data, and excludes them when we query, but we ended up writing twice as much(once for updating the flag in the previous index and once for creating new doc in current index). This causes CPU load to spike and generates massive delay on our app.
We also prototyped another approach, using one index only, with a cronjob that deletes data older than 36 hours. But the query performance is much(10x) worse than hourly index. I figured because the request would propagate to all shards/nodes and the overhead is huge. We also thought about creating custom routing by user ID so the queries don't hit all shards, but fear this might cause extremely unbalanced shards. Routing by time is also not an option because the "modified" time is updated every 10 minutes.
My setup looks like this: 2 master nodes and 10 data nodes, with ec2 i3.4xlarge instances https://www.ec2instances.info/?selected=i3.4xlarge. 2TB of data for 36 hours with 1/2 billion docs, peak write rate at 50k/s. 20 shards 1 replica which each shard ended up holding 50GB of data.
My question would be:
- Is there a built in way to dedupe by a field across indices for a query(nothing that I found)?
- If not, would more shards make a difference for a query? Current shard size looks pretty big, but more shards means more propagation of a search request/more query contentions per node, not sure what would the overhead look like.
- For single index, would the performance be the same to have twice as many nodes but half of the size each? Even each node holds less shards, but less powerful.
I hope I made myself clear, any suggestions would be helpful. Thanks.