Hi all,
I use Elasticsearch 6.6 and i have a question about the number of shards that i need. I plan to index csv log files which have thousands of rows each. These csvs have also a header (first 8 lines with '#' in front of), see below image.
So, i thought to use the parent-child approach as below
Hi @dadoonet and thanks for the quick reply!
I have already tried it and worked! The header of the raw data csv files was merged as extra columns in the body so as to be trouble-free indexed to Elasticsearch. But this, potentially, due to large log files that will come soon, will cause serious memory redundacy, which would definitely lead to increased needs for extra disk space in my machine. This is a small sample of the real csv which will have much more fields both in header and in body.
So, i need parent-child approach.
@dadoonet Isn't true that with parent-child model i will have less memory needs? The header fields are stored only once inside the index for each log csv, right ? I am a bit confused now. What do you suggest?
As elasticsearch has to perform "joins" at search time, it will load all the document ids in the heap.
2 basic questions on this:
a) The elastic loads all the documents ids with the whole row or just the columns that i need in my search?
b) The elastic loads all the documents ids outright or gradually? Because, if it loads all at once will cause heap memory failure. Right?
Thank you @dadoonet, i will study these links.
According to the primary question of the topic: How many shards should i have?
And, for each different relationship header-child should i use different "_routing", or can i store all the logs inside the same shard?
@dadoonet Do you believe that in my case should fit best the approach of nested approach?
I asked because it's not clear to me what is the actual difference between parent-child and nested fields.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.