Elasticsearch 6.x Join behaviour

Hello

Reading the v6.x documentation,
preparing for migration from 2.4/5.5 to 6.x later on,

In some indices we currently have multiple types (already something to change) and parent-child relations between different types in the same index...

So given indices will no longer support multiple types per index, and parent/child is changing, stumbled upon this documentation: https://www.elastic.co/guide/en/elasticsearch/reference/6.x/parent-join.html

One line which i quite puzzled about is (yes its the first line...): The join datatype is a special field that creates parent/child relation within documents of the same index.

Now this is not a replacement for the current parent child relation which support different types, and because different types can no longer be held on same index, i cant use parent child between different types anymore?

Or am i missing something?

Hello,
Elasticsearch 6.x will only allow you to have a single document type per index.
The new join datatype will allow you to still model parent-child relationships in your data with a single document type.

What does this mean for you?
You will have to migrate to a single type, but no functionality will be lost with the new join functionality. You will still be able to do all queries that you can do right now.

Does this answer your question?

Luca

1 Like

Hello,
Yes, i understand the suggested approach (single index, single type, merge fields from types we currently have parent-child relations)

It will allow us to create fields of all "logical" types even if we use strict mapping

Is there a future planning for join between documents on different indices? (yes i'm aware that current parent-child works due to co-location of the document on the same shard via routing and joining between different indices is significantly harder in distributed parallel system :slight_smile: )

In the past there was a plugin (siren join) providing this functionality, currently unsupported (developers concentrated on wider commercial solution)

Query-time joins across networks will always be expensive and so we don't support it.

If I recall correctly their approach to scaling joins with large numbers of IDs was to save space by sending hashes rather than full ID strings and join only using those. In Java this would be the equivalent of a HashMap based on objects that implemented hashCode() but not equals() - it would be fast but (scarily) you have the potential for false positives.
Sending a lot of data over a network is slow and physics is a tough thing to beat.

My concern is also similar along these lines where type allowed some sort of abstracted in cases of conflicts.

Let me elaborate on a bit from the following unanswered question of mine linked below:


Assume that we store stackoverflow data on an index. So the appropriate minimal schema would be:

Question:

  • Title
  • Content

Answers:

  • Content

In JSON, this would look something like:

{
    "mappings":{
        "question": {
            // Title, content mappings
        },
        "answer": {
            "_parent": {
                "type": "question"
            }
        }
     }
}

So this allowed me to write separate non-conflicting docs to each type:

curl -XPUT localhost:9200/so_index/question/1?routing=1 -d '{"title": "..", "content": ".."}'
curl -XPUT localhost:9200/so_index/answer/1?routing=1&parent=1 -d '{"content": ".."}'

So with the new implementation, not only I'm forced to map the "child type" to a parent (which will be removed in ES 7 of course), I've to also resolve _id 1 of parent to not conflict with _id 1 of child. So when writing a child, like shown below, it would actually overwrite the parent if the _id is same.

// Writing PARENT
curl -XPUT localhost:9200/so_index/question/1?routing=1 -d '{"title": "..", "content": "..", "join_type": { "name": "question" }}'

// Writing CHILD.
// Passing parent "type" when writing child
// Also this would overwrite the parent actually
curl -XPUT localhost:9200/so_index/question/1?routing=1 -d '{"content": "..", "join_type": { "name": "answer", "parent": 1 }}'

In such a case, is there any provision or setting to prevent this from happening? Or is it upto the client to ensure sending non-conflicting IDs in such cases?

Thank you,
We will remove parent child relations (which we currently use as replacement of joins) and will do the joins in application

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.