Is Elasticsearch the wrong technology stack for our use case?

Hey everyone,

I am facing a technology stack decision for a upcoming feature, and I am wondering if Elasticsearch is the right choice.

My worst case scenario would include around 500 million parent documents with ~1,5 billion daily child documents added to the cluster (each including two big ints and a datetime). I would need to at least retain 30 days worth of data and there will will be heavy aggregations over children based on parent values (e.g. "give me all children that did x whose parent has been calculated by y since date z").

We currently have a smaller scale version of this system running on a cluster with 23 20-core machines dedicated to the index containing the data: With 75 shards and one replica we have 5.4 billion documents with a total size of 1.2 TB and it's performing... well... not as good as we hoped.

The queries are taking two to seven seconds - thanks to the parent-child joins - and denormalizing the data is no option as we are constantly getting around 2k docs per second which would result in a huge write amplification. So in its current state this will not scale ... at all. I'm quite fond about Elasticsearch and willing to push the limits, but apparently parent-child is not exactly a strong trait of ES (sadly!).

Does anyone have a recommendation which technology stack could be life safer for us? Maybe a PostgreSQL cluster would a better approach for this use case? Maybe Cloudspanner?

Any feedback/ideas welcome! :slight_smile:

Why can't you flatten things?

Parent-Child is especially useful if you have parents that get updated enough to make updating a flattened model impractical, either due to update frequency or number of children per parent. For time based data a flattened model is usually preferable as it allows you to efficiently phase out whole indices with expired data rather than have to delete documents from a large index, which is considerable more resource intensive.

The parent document will be relative static (it get's updated maybe once every few weeks), the child documents are completely static.

Flattening the children into the parent would result in constant rewrites for all documents on every new children added, which will be multiple times per day for every parent (as almost every parent will receive around 20 children per pay, some even up in the hundreds). I doubt the SSDs will be happy this way, at least not for long, plus the permanent reindexing would be a massive drain on the system resources.

We could de-normalize the parent information into the children, but updating all children everytime the parent changes also results in good amount of writes and the space requirements suddenly grow by a huge factor...

Okay, we have come to a conclusion: we'll have to migrate one of our core features away from Elasticsearch.

This decision is driven by two main reasons:

  • No Sub-Queries. The performance for has_child is awful if you have lots of parents matching the parent filters. In our case it's around 6 million parents (out of around 60 million) that must join all children in order to find the 85 thousand entries matching the children filter criteria.
    We can't go the other way because we require aggregations over the parent documents. And fetching hundred thousands of ids and sending them back via a terms query to avoid the parent-child join is also no option.

  • we can't enforce a filter execution order: we know our data, and seeing shards doing the worst case execution order and therefore degrading query/node throughput is concerning, to say the least.

Perhaps there any plans to address those issues in the near future?

Changing the database technology and adding another cluster to our infrastructure stack is nothing I'm looking forward to... :neutral_face:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.