Environment: 4 nodes, 124GB RAM total, ~.5TB of data, v5.6.8
Indices: 1 main index, 1 parent doc type, 3 child doc types, ~300 index operations/s on the primary shards
Clients: Primarily the Java SDK
Issue Symptoms
Our core index has a parent doc type (elements) and a secondary core type (metrics) which is a child document (linked by a primary key). We have other child doc types, but one is enough to illustrate the issue. Metrics are also embedded docs to elements, but they are duplicated to their own doc type so we can query and fetch metrics matching criteria, not just elements with metrics matching criteria.
We've found that metrics in the child doc index go missing. I've done a lot of research into this to try to replicate. The two scenarios I've tried to replicate are:
Are we getting bulk index failures when saving metrics? I log them and can't find a single one. I can replicate errors in a test environment, but that's through maxing out the bulk index thread pool.
Are metrics getting indexed on a primary shard, not yet replicated to secondary shards, then the primary fails over? I've tried replicating this as well with a small Dockerized cluster without success.
Questions
I've got two main lines of questions:
Does anyone have any advice on other things to try? Have you seen a scenario like this before? Is the Java SDK lying to me about bulk index errors? Is nothing possibly wrong with Elasticsearch and I need to dig into my app code more?
I understand duplicating data like this is not great. I don't like it. Do you know of a better way to model the data so I can pull back child documents from queries, but only storing them as embedded documents?
Thanks in advance, I'm happy to provide more detail if needed.
When you asking about it lying for bulk indexing errors, what errors are you talking about?
Something to check is whether you are indexing any of the parent documents with custom routing values, in the event that you index the parent type with a custom routing and then use the _id (instead of the routing value) for the child type, they will end up in different shards and the has_child query won't find the document.
I understand duplicating data like this is not great. I don't like it. Do you know of a better way to model the data so I can pull back child documents from queries, but only storing them as embedded documents?
You may be able to return the documents that you want without duplicating them into the parent document by using the inner_hits parameter, see: Inner hits | Elasticsearch Guide [6.4] | Elastic
Thanks for your response. I'm not sure which errors exactly could be getting thrown, but when I overload the bulk indexing thread pool I do get those errors as part of the bulk response. My question is whether or not there are other thread pools getting overloaded (or something else) which is causing the the documents to not be saved, but isn't returning an error in the bulk index response.
Thanks for the inner hits recommendation, I'll look into it more. The four core queries we make are:
Search for matching parent documents (trivial with a single index)
Search for matching nested documents (looks to be possible with the inner_hits query)
Aggregate matching parent documents (also trivial with a single index)
Aggregate matching nested documents (TBD)
If we can search for and aggregate nested documents then perhaps all my troubles are solved.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.