Is there any solution with elasticsearch parent-child join

wallellen · January 2, 2019, 11:36pm

I have an es settings like following:

     PUT /test
    {
    "mappings": {
     "doc": {
      "properties": {
        "status": {
          "type": "keyword"
        },
        "counting": {
          "type": "integer"
        },
        "join": {
          "type": "join",
          "relations": {
            "vsim": ["pool", "package"]
          }
        },
        "poolId": {
          "type": "keyword"
        },
        "packageId": {
          "type": "keyword"
        },
        "countries": {
          "type": "keyword"
        },
        "vId": {
          "type": "keyword"
        }
      }
    }
   }}

Then add data:

    // add vsim
    PUT /test/doc/doc1
    {"counting":6, "join": {"name": "vsim"}, "content": "1", "status": "disabled"}

    PUT /test/doc/doc2
    {"counting":5,"join": {"name": "vsim"}, "content": "2", "status": "disabled"}

    PUT /test/doc/doc3
    {"counting":5,"join": {"name": "vsim"}, "content": "2", "status": "enabled"}

    // add package
    PUT /test/doc/ner2?routing=doc2
    {"join": {"name": "package", "parent": "doc2"}, "countries":["CN", "UK"]}

    PUT test/doc/ner12?routing=doc1
    {"join": {"name": "package", "parent": "doc1"}, "countries":["CN", "US"]}

    PUT /test/doc/ner11?routing=doc1
    {"join":{"name": "package", "parent": "doc1"}, "countries":["US", "KR"]}

    PUT /test/doc/ner13?routing=doc3
    {"join":{"name": "package", "parent": "doc3"}, "countries":["UK", "AU"]}


    // add pool
    PUT /test/doc/ner21?routing=doc1
    {"join": {"name": "pool", "parent": "doc1"}, "poolId": "MER"}

    PUT /test/doc/ner22?routing=doc2
    {"join": {"name": "pool", "parent": "doc2"}, "poolId": "MER"}

    PUT /test/doc/ner23?routing=doc2
    {"join": {"name": "pool", "parent": "doc2"}, "poolId": "NER"}

and then I want to count the counting group by the status(vsim), poolId(pool) and countries(package), the expect result like:

disabled-MER-CN: 3

disabled-MER-US: 3

enabled-MR-CN: 1
... and so on.
I'm a new player for elasticsearch, and I have learnt the document like

https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html
and
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html

but still have no idea to implement this aggregation query, PLEASE give me some suggestion, thanks!

wallellen · January 3, 2019, 1:06pm

I need some suggestions or references, PLEASE

Christian_Dahlqvist · January 3, 2019, 1:14pm

Why have you opted for parent-child here rather than e.g. denormalising the data? Are you using it for the right reason?

Often users jump at parent-child in an attempt to emulate a relational data model, and this is often, as out lined in the documentation, not generally the best way to go about working with Elasticsearch:

The join field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance.

wallellen · January 4, 2019, 4:05am

Before this parent-child join, we use the nested structure like following:

vsim(1) <--> (n)package vsim(1) <-->(n) pool

In es, we have about 300,000 vsim and 600,000. The vsim or package changes frequently, when we try to update fields in the package, we found the performance for the changing is very slow, maybe 2000 changes cost 20s.

So we try the parent-child join, we found this can solve the update performance, but create new problems, the aggregation is too difficult to build

dadoonet · January 4, 2019, 6:44am

Why was that slow? Were you using a update API? Or single index API? Or bulk API?

That's the price to pay with parent/child. And also memory usage.

wallellen · January 4, 2019, 6:50am

First, the es cluster is not maintained by us, so we cannot change the default index setting or optimize the es setting, like the refresh interval, etc.

So what we can do is just try to use rest api to update the document, we tried to use stored script to update the package in the vsim with params, but performance is not good.
And we tried to store all the vsim and package in nested in cache, when the package updated, we flush the changed vsim and package to ES with bulk api. The bulk size is 700.

PS: 1. the package and vsim changed very frequently, sometimes 15 per second, sometimes 1000 per second, mostly changed size keeps at 300 per second.
2. For the data, we need care the sequence of data changed, we couldn't lose data, so we need use synchronized bulk api, if bulk request has any failure, we need retry to update until the changed data refresh to the ES.

dadoonet · January 4, 2019, 8:48am

What do you mean? Where is this running?

Not sure I understood what vsim is and I understood this part.

the package and vsim changed very frequently, sometimes 15 per second, sometimes 1000 per second, mostly changed size keeps at 300 per second.

Yeah. So a lot of updates has a price on a cluster wether you are using parent/child or not. Of course with p/c you can probably reduce the load at index/update time. But as you know, it comes with drawbacks.

I'm not sure that there is an ideal solution. It's more a question of compromise.
I mean that I'd use parent/child only and only if I have no other choice and I tried all what I could with data denormalization.

Christian_Dahlqvist · January 4, 2019, 8:55am

How frequently are you updating individual documents within the model? If you are updating a document that has not yet been written to a segment, this will trigger a refresh in recent versions which can have a significant impact on performance.

wallellen · January 4, 2019, 9:36am

the es cluster is maintained by others in our company, and we shouldn't change the default setting about es.

i will try the nested one with this, and maybe change some settings on this index for updating.

Thanks very much.

Christian_Dahlqvist · January 4, 2019, 10:01am

Nested documents may be even worse for performance, especially if you have frequent updates.

wallellen · January 4, 2019, 10:26am

Thanks very much! Maybe plain the datas is the only choise for me.

system · February 1, 2019, 10:26am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parent-Child relationship using join Elasticsearch	1	585	December 12, 2017
How to create a parent-child documents using High Level ES REST client Elasticsearch	3	1068	July 12, 2018
Parent-Child in ES 6.0 Elasticsearch	5	442	May 10, 2019
Elasticsearch 6.0 and joining queries Elasticsearch	5	1098	December 29, 2017
Queries on has_parent / has_child don't return any hits Elasticsearch	7	898	January 6, 2020

Is there any solution with elasticsearch parent-child join

Related topics