Shards needed in parent-child indexing

Hi all,
I use Elasticsearch 6.6 and i have a question about the number of shards that i need. I plan to index csv log files which have thousands of rows each. These csvs have also a header (first 8 lines with '#' in front of), see below image.

image

So, i thought to use the parent-child approach as below

  • Create the index:
    PUT /npk_log
    {
    "settings" : {
    "index" : {
    "number_of_shards" : 3,
    "number_of_replicas" : 2
    }
    }
    }

  • Create the mapping:
    PUT npk_log/_mapping/doc
    {
    "dynamic": true,
    "properties": {
    "join_field" : {
    "type" : "join",
    "relations": {
    "header": "body"
    }
    },
    "latitude": {
    "type": "float",
    "fields": {
    "keyword": {
    "type": "keyword",
    "ignore_above": 256
    }
    }
    },
    "longitude": {
    "type": "float",
    "fields": {
    "keyword": {
    "type": "keyword",
    "ignore_above": 256
    }
    }
    }
    }

My question is: How can i handle the shards that both parent-children may store in? Is it right the way i created my index? I use the _routing field as below. Is it right in this way? I read here, but i am confused: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-parent-child.html

  • index a parent
    PUT /npk/doc/1
    {
    "SystemId": "00:04:4b:df:35:96",
    "Mode": "NORMAL",
    "Operation": "Manual Fertilizer Application",
    "join_field": "header"
    }

  • index 3 children
    POST /npk/_bulk

    { "_type" : "doc", "_id" : "1", "_routing": 1, "latitude": 39.12366, "langitude": 23.4566,
    "join_field": {"name": "body","parent": "1"}}
    { "_type" : "doc", "_id" : "2", "_routing": 1, "latitude": 36.29867, "langitude": 23.213,
    "join_field": {"name": "body","parent": "1"}}
    { "_type" : "doc", "_id" : "3", "_routing": 1, "latitude": 39.298671,"langitude": 23.223,
    "join_field": {"name": "body","parent": "1"}}

Why not indexing only child documents and duplicate every field from the parent doc in the child document?

So no parent/child in that case.

Hi @dadoonet and thanks for the quick reply!
I have already tried it and worked! The header of the raw data csv files was merged as extra columns in the body so as to be trouble-free indexed to Elasticsearch. But this, potentially, due to large log files that will come soon, will cause serious memory redundacy, which would definitely lead to increased needs for extra disk space in my machine. This is a small sample of the real csv which will have much more fields both in header and in body.
So, i need parent-child approach.

So you prefer consuming much more memory, being slower to save extra disk space?

@dadoonet Isn't true that with parent-child model i will have less memory needs? The header fields are stored only once inside the index for each log csv, right ? I am a bit confused now. What do you suggest?

Not really. As elasticsearch has to perform "joins" at search time, it will load all the document ids in the heap.

As I wrote before, a data structure like:

PUT /npk/_doc/1
{
  "location": {
     "lat": 39.12366, 
     "lon": 23.4566
  }, 
  "SystemId": "00:04:4b:df:35:96",
  "Mode": "NORMAL",
  "Operation": "Manual Fertilizer Application"
}
PUT /npk/_doc/2
{
  "location": {
     "lat": 36.29867, 
     "lon": 23.213
  }, 
  "SystemId": "00:04:4b:df:35:96",
  "Mode": "NORMAL",
  "Operation": "Manual Fertilizer Application"
}
PUT /npk/_doc/2
{
  "location": {
     "lat": 39.298671, 
     "lon": 23.223
  }, 
  "SystemId": "00:04:4b:df:35:96",
  "Mode": "NORMAL",
  "Operation": "Manual Fertilizer Application"
}

@dadoonet You say that:

As elasticsearch has to perform "joins" at search time, it will load all the document ids in the heap.

2 basic questions on this:
a) The elastic loads all the documents ids with the whole row or just the columns that i need in my search?
b) The elastic loads all the documents ids outright or gradually? Because, if it loads all at once will cause heap memory failure. Right?

I "think" it loads all the ids but experts in this area could confirm.

I'd read:

Thank you @dadoonet, i will study these links.
According to the primary question of the topic: How many shards should i have?
And, for each different relationship header-child should i use different "_routing", or can i store all the logs inside the same shard?

What would be the total size in gb of your index?
If possible yes, it's better at read time to have one single shard.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

1 Like

As this index is planned to collect logs from our customers' IoT machines in a continuous base, it will definitely be large, of about terrabytes.

So definitely look at the resources I shared.

1 Like

Thank you @dadoonet. I 'll study the material.

@dadoonet Do you believe that in my case should fit best the approach of nested approach?
I asked because it's not clear to me what is the actual difference between parent-child and nested fields.

nested could be useful when you have in your documents an array of objects.
It does not seem the case here.

1 Like