Is there any solution with elasticsearch parent-child join

I have an es settings like following:

     PUT /test
    {
    "mappings": {
     "doc": {
      "properties": {
        "status": {
          "type": "keyword"
        },
        "counting": {
          "type": "integer"
        },
        "join": {
          "type": "join",
          "relations": {
            "vsim": ["pool", "package"]
          }
        },
        "poolId": {
          "type": "keyword"
        },
        "packageId": {
          "type": "keyword"
        },
        "countries": {
          "type": "keyword"
        },
        "vId": {
          "type": "keyword"
        }
      }
    }
   }}

Then add data:

    // add vsim
    PUT /test/doc/doc1
    {"counting":6, "join": {"name": "vsim"}, "content": "1", "status": "disabled"}

    PUT /test/doc/doc2
    {"counting":5,"join": {"name": "vsim"}, "content": "2", "status": "disabled"}

    PUT /test/doc/doc3
    {"counting":5,"join": {"name": "vsim"}, "content": "2", "status": "enabled"}

    // add package
    PUT /test/doc/ner2?routing=doc2
    {"join": {"name": "package", "parent": "doc2"}, "countries":["CN", "UK"]}

    PUT test/doc/ner12?routing=doc1
    {"join": {"name": "package", "parent": "doc1"}, "countries":["CN", "US"]}

    PUT /test/doc/ner11?routing=doc1
    {"join":{"name": "package", "parent": "doc1"}, "countries":["US", "KR"]}

    PUT /test/doc/ner13?routing=doc3
    {"join":{"name": "package", "parent": "doc3"}, "countries":["UK", "AU"]}


    // add pool
    PUT /test/doc/ner21?routing=doc1
    {"join": {"name": "pool", "parent": "doc1"}, "poolId": "MER"}

    PUT /test/doc/ner22?routing=doc2
    {"join": {"name": "pool", "parent": "doc2"}, "poolId": "MER"}

    PUT /test/doc/ner23?routing=doc2
    {"join": {"name": "pool", "parent": "doc2"}, "poolId": "NER"}

and then I want to count the counting group by the status(vsim), poolId(pool) and countries(package), the expect result like:

disabled-MER-CN: 3

disabled-MER-US: 3

enabled-MR-CN: 1
... and so on.
I'm a new player for elasticsearch, and I have learnt the document like

https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html
and
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html

but still have no idea to implement this aggregation query, PLEASE give me some suggestion, thanks!

I need some suggestions or references, PLEASE

Why have you opted for parent-child here rather than e.g. denormalising the data? Are you using it for the right reason?

Often users jump at parent-child in an attempt to emulate a relational data model, and this is often, as out lined in the documentation, not generally the best way to go about working with Elasticsearch:

The join field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance.

Before this parent-child join, we use the nested structure like following:

vsim(1) <--> (n)package vsim(1) <-->(n) pool

In es, we have about 300,000 vsim and 600,000. The vsim or package changes frequently, when we try to update fields in the package, we found the performance for the changing is very slow, maybe 2000 changes cost 20s.

So we try the parent-child join, we found this can solve the update performance, but create new problems, the aggregation is too difficult to build

Why was that slow? Were you using a update API? Or single index API? Or bulk API?

That's the price to pay with parent/child. And also memory usage.

First, the es cluster is not maintained by us, so we cannot change the default index setting or optimize the es setting, like the refresh interval, etc.

So what we can do is just try to use rest api to update the document, we tried to use stored script to update the package in the vsim with params, but performance is not good.
And we tried to store all the vsim and package in nested in cache, when the package updated, we flush the changed vsim and package to ES with bulk api. The bulk size is 700.

PS: 1. the package and vsim changed very frequently, sometimes 15 per second, sometimes 1000 per second, mostly changed size keeps at 300 per second.
2. For the data, we need care the sequence of data changed, we couldn't lose data, so we need use synchronized bulk api, if bulk request has any failure, we need retry to update until the changed data refresh to the ES.

What do you mean? Where is this running?

Not sure I understood what vsim is and I understood this part.

the package and vsim changed very frequently, sometimes 15 per second, sometimes 1000 per second, mostly changed size keeps at 300 per second.

Yeah. So a lot of updates has a price on a cluster wether you are using parent/child or not. Of course with p/c you can probably reduce the load at index/update time. But as you know, it comes with drawbacks.

I'm not sure that there is an ideal solution. It's more a question of compromise.
I mean that I'd use parent/child only and only if I have no other choice and I tried all what I could with data denormalization.

How frequently are you updating individual documents within the model? If you are updating a document that has not yet been written to a segment, this will trigger a refresh in recent versions which can have a significant impact on performance.

the es cluster is maintained by others in our company, and we shouldn't change the default setting about es.

i will try the nested one with this, and maybe change some settings on this index for updating.

Thanks very much.

Nested documents may be even worse for performance, especially if you have frequent updates.

Thanks very much! Maybe plain the datas is the only choise for me.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.