Shards needed in parent-child indexing

johnkary · January 13, 2020, 4:08pm

Hi all,
I use Elasticsearch 6.6 and i have a question about the number of shards that i need. I plan to index csv log files which have thousands of rows each. These csvs have also a header (first 8 lines with '#' in front of), see below image.

So, i thought to use the parent-child approach as below

Create the index:
PUT /npk_log
{
"settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}
Create the mapping:
PUT npk_log/_mapping/doc
{
"dynamic": true,
"properties": {
"join_field" : {
"type" : "join",
"relations": {
"header": "body"
}
},
"latitude": {
"type": "float",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"longitude": {
"type": "float",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}

My question is: How can i handle the shards that both parent-children may store in? Is it right the way i created my index? I use the _routing field as below. Is it right in this way? I read here, but i am confused: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-parent-child.html

index a parent
PUT /npk/doc/1
{
"SystemId": "00:04:4b:df:35:96",
"Mode": "NORMAL",
"Operation": "Manual Fertilizer Application",
"join_field": "header"
}
index 3 children
POST /npk/_bulk

{ "_type" : "doc", "_id" : "1", "_routing": 1, "latitude": 39.12366, "langitude": 23.4566,
"join_field": {"name": "body","parent": "1"}}
{ "_type" : "doc", "_id" : "2", "_routing": 1, "latitude": 36.29867, "langitude": 23.213,
"join_field": {"name": "body","parent": "1"}}
{ "_type" : "doc", "_id" : "3", "_routing": 1, "latitude": 39.298671,"langitude": 23.223,
"join_field": {"name": "body","parent": "1"}}

dadoonet · January 13, 2020, 4:49pm

Why not indexing only child documents and duplicate every field from the parent doc in the child document?

So no parent/child in that case.

johnkary · January 14, 2020, 8:53am

Hi @dadoonet and thanks for the quick reply!
I have already tried it and worked! The header of the raw data csv files was merged as extra columns in the body so as to be trouble-free indexed to Elasticsearch. But this, potentially, due to large log files that will come soon, will cause serious memory redundacy, which would definitely lead to increased needs for extra disk space in my machine. This is a small sample of the real csv which will have much more fields both in header and in body.
So, i need parent-child approach.

dadoonet · January 14, 2020, 11:18am

So you prefer consuming much more memory, being slower to save extra disk space?

johnkary · January 14, 2020, 11:37am

@dadoonet Isn't true that with parent-child model i will have less memory needs? The header fields are stored only once inside the index for each log csv, right ? I am a bit confused now. What do you suggest?

dadoonet · January 14, 2020, 12:39pm

Not really. As elasticsearch has to perform "joins" at search time, it will load all the document ids in the heap.

As I wrote before, a data structure like:

PUT /npk/_doc/1
{
  "location": {
     "lat": 39.12366, 
     "lon": 23.4566
  }, 
  "SystemId": "00:04:4b:df:35:96",
  "Mode": "NORMAL",
  "Operation": "Manual Fertilizer Application"
}
PUT /npk/_doc/2
{
  "location": {
     "lat": 36.29867, 
     "lon": 23.213
  }, 
  "SystemId": "00:04:4b:df:35:96",
  "Mode": "NORMAL",
  "Operation": "Manual Fertilizer Application"
}
PUT /npk/_doc/2
{
  "location": {
     "lat": 39.298671, 
     "lon": 23.223
  }, 
  "SystemId": "00:04:4b:df:35:96",
  "Mode": "NORMAL",
  "Operation": "Manual Fertilizer Application"
}

johnkary · January 14, 2020, 1:25pm

@dadoonet You say that:

As elasticsearch has to perform "joins" at search time, it will load all the document ids in the heap.

2 basic questions on this:
a) The elastic loads all the documents ids with the whole row or just the columns that i need in my search?
b) The elastic loads all the documents ids outright or gradually? Because, if it loads all at once will cause heap memory failure. Right?

dadoonet · January 14, 2020, 2:18pm

I "think" it loads all the ids but experts in this area could confirm.

I'd read:

johnkary · January 14, 2020, 3:22pm

Thank you @dadoonet, i will study these links.
According to the primary question of the topic: How many shards should i have?
And, for each different relationship header-child should i use different "_routing", or can i store all the logs inside the same shard?

dadoonet · January 15, 2020, 6:03am

What would be the total size in gb of your index?
If possible yes, it's better at read time to have one single shard.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

johnkary · January 15, 2020, 2:34pm

As this index is planned to collect logs from our customers' IoT machines in a continuous base, it will definitely be large, of about terrabytes.

dadoonet · January 15, 2020, 2:52pm

So definitely look at the resources I shared.

johnkary · January 16, 2020, 10:47am

Thank you @dadoonet. I 'll study the material.

johnkary · January 17, 2020, 9:53am

@dadoonet Do you believe that in my case should fit best the approach of nested approach?
I asked because it's not clear to me what is the actual difference between parent-child and nested fields.

dadoonet · January 17, 2020, 10:19am

nested could be useful when you have in your documents an array of objects.
It does not seem the case here.

system · February 14, 2020, 10:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[parent-child] on shards/nodes number Elasticsearch	3	388	July 6, 2017
Index parent-child with _id Elasticsearch	1	462	February 14, 2020
Parent-child v/s multiple indexes - Elasticsearch 6.0+ Elasticsearch	1	2753	June 11, 2018
Implementing a parent-child relationship and changing parent of a child document Elasticsearch	3	406	July 29, 2021
Parent-Child query for 6.3.2 Elasticsearch	6	989	September 2, 2018

Shards needed in parent-child indexing

Related topics