ES design regarding duplicates across indexes

Hi all:

ES noob here. I'm using ES as our main storage but facing a dilemma. We store service data up to 36 hours, our first design is to make hourly indices, so the whole index could be deleted once it's expired. Each service has a unique ID and both new and existing services would be written/updated to ES every 10 minutes.

The problem is that if a service exists across the hours, say index1 and index2 and we ask for 2 hours of data, the query would return duplicates coming from both indices. Our query is paginated so we are not fetching all records, which makes the dedupe impossible. We created a workaround that adds a flag to the docs in previous index if current index has the same data, and excludes them when we query, but we ended up writing twice as much(once for updating the flag in the previous index and once for creating new doc in current index). This causes CPU load to spike and generates massive delay on our app.

We also prototyped another approach, using one index only, with a cronjob that deletes data older than 36 hours. But the query performance is much(10x) worse than hourly index. I figured because the request would propagate to all shards/nodes and the overhead is huge. We also thought about creating custom routing by user ID so the queries don't hit all shards, but fear this might cause extremely unbalanced shards. Routing by time is also not an option because the "modified" time is updated every 10 minutes.

My setup looks like this: 2 master nodes and 10 data nodes, with ec2 i3.4xlarge instances https://www.ec2instances.info/?selected=i3.4xlarge. 2TB of data for 36 hours with 1/2 billion docs, peak write rate at 50k/s. 20 shards 1 replica which each shard ended up holding 50GB of data.

My question would be:

  1. Is there a built in way to dedupe by a field across indices for a query(nothing that I found)?
  2. If not, would more shards make a difference for a query? Current shard size looks pretty big, but more shards means more propagation of a search request/more query contentions per node, not sure what would the overhead look like.
  3. For single index, would the performance be the same to have twice as many nodes but half of the size each? Even each node holds less shards, but less powerful.

I hope I made myself clear, any suggestions would be helpful. Thanks.

Using time based is best.

Can you elaborate more on your document structure? If this is time based data then usually you'd have an event at $time, and then event at $time+n and they would both have different values. Is that not the case?

That's bad, see Important Configuration Changes | Elasticsearch: The Definitive Guide [2.x] | Elastic

So if I understand correctly each event is either creating a new service doc or updating an existing one.
I can think of a few approaches -

  1. Using time based indices:
    a) de-dup on query
    b) de-dup on index
  2. A single index for services

1a can be achieved using things like field collapsing for search or some other trickery when performing aggregations (we'd need a follow-up discussion on what sorts of queries you need to perform)

1b could possibly be achieved with some index-routing if the service update event your client app receives happens to include the start time of that service as well as the ID (perhaps the time is included in the ID?)

With 2) you can prune expired services using delete by query but this will be more expensive for Lucene to manage.

Hey Mark:

Thanks for the suggestion, sorry for not making myself clear. As the other Mark(the second reply) says, we are not using the usual time based approach. Each event is either creating a new doc or updating an existing doc, all docs are meant to track the latest status of a service. We do this because we don't want the same service to appear in 100 docs with only minor changes in the fields.

The problem for our approach with hourly index is that once the hour flips, we still write existing services to the new index, but when the query timeframe crosses both indices, you would end up getting duplicates.

Does that sound clearer to you? Please let me know if you need more details. Again thanks for helping out, cheers.

I understand, I call this entity-centric indexing [1]. The entity in your case is a "service" and it goes through a life cycle marked by a series of events. The challenge you are facing is that you are trying to use a time-based indexing pattern used for append-only time-based events to maintain entities that span time divisions.
You need to think about the complexity of the properties you hope to maintain as new events arrive. Update requests which update the entity with new events can sometimes change simple properties (e.g. currentStatus) or perhaps more complex derived properties (e.g. duration - the last timestamp minus the first-seen timestamp).
Simple properties don't necessarily require an entity-centric index - for example it may be possible to query an event-centric set of indices and filter them at query-time to find the latest doc for each service ID. This is option 1a in my example. If you have complex properties (e.g. my duration example) then this logic is easier and more efficient to perform using an entity-centric index which is my option 2. The problem with option 2 is ageing out of old content is harder.
Option 1b is kind of a halfway house. You have the content-ageing benefits of using time-based indices but need to route updates to the time-based index where you maintain the only copy of a service entity based on it's first insertion date rather than the last update date.

[1] https://www.youtube.com/watch?v=yBf7oeJKH2Y

1 Like

Hey Mark(Harwood):

Sorry for making you confused, there are 2 replies in this thread, you and the other person both named Mark, I was replying to the other Mark(Mark Walkom).

I'm definitely aware of your suggestions and they make a lot of sense. I'm testing out the field collapsing functionality you mentioned and see if we benefit from it. Our query is pretty simple, mostly paginated search query. I'll get back to you with more questions.

You wouldn't believe how often this sort of thing happens :slight_smile:

Neither would you :stuck_out_tongue:

1 Like

I ran into this same problem using weekly indices. We came up with 2 alternatives (aside from single index with DELETE_BY_QUERY Cron jobs).

  1. Route to one of two rolling indices based on creation timestamp (when the entity is first created). And when doing searches, search both indices. With this option, the entity will always be on only one index and will have the correct information/updates. Searching both indices is easily doable with aliases.

[edit] assuming you do things in bulk, this will result in twice the amount of bulk requests (one to each index/alias), but the individual number of index/update inside those bulks will remain the same.

  1. Overlap your rolling indices such that you're always writing to two indices but only reading from the older one as it has more data. In this way, since you are only searching one index, it will not dupe.

[edit] this will require twice the number of bulk requests and twice the number of index requests

I personally think #1 is better.

Let me know if these solutions work for you or you need further details

[Edit] Also regarding your current workaround, if you are having CPU spikes due to having to write twice as much. Consider adding more data nodes. Right now you have 10 data nodes for 40 shards (20 primary, 20 replicas), that's 4 shards per node. Consider increasing to 20 nodes or 40 nodes.

Also for #2, consider adding more replicas. The search query only has to hit one replica of each shard and so by increasing availability you can speed up search queries. The downsides is indexing/updating may be slower. Depending on what you value more, indexing or searching, consider this .

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.