Ingestion from ES-Spark to Ingestion node

snambu5878 · September 8, 2020, 2:59am

Hi Everyone,

We have 12 node cluster, 1 ingestion, 3 master, 3 hot data, 2 warm data, 2 kibana (same for coordinating) nodes and 1 monitoring node.

We are trying to ingestion data from Hadoop environment using ES-Spark plugin into elasticsearch. We need to ingest last 5 years of data into ES out of which 5th year of data should be ingested into Hot node, and 1-4rth year of data into warm nodes.

Don't see a way in the documentation to achieve this. Right now, when we run spark-submit against ingestion node - the entire data is loaded into both hot and warm nodes which we don't want. Is there any way we can mention that the data should be ingested into hot nodes vs warm nodes.

We would like to run this process daily as the delta of the data is really huge.

Any suggestions would be greatly appreciated.

Thank you very much.
Sri

warkolm · September 8, 2020, 3:00am

You should look at using ILM. If you can't use that, what you want is allocation filtering.

Christian_Dahlqvist · September 8, 2020, 3:29am

What is the specification of the hot and warm nodes respectively? Are you using ingest pipelines? When you process deltas will that be new data only or may it include changes to older data? How large is the data set expected to be on disk once indexed (primary shard size only)?

snambu5878 · September 8, 2020, 1:19pm

Thank you Warkolm and Chris for the response.

Here are the details:

Hot nodes (per node): 40 CPU, 196GB RAM, 2TB storage (SSD)
Warm nodes (per node): 40 CPU, 196GB RAM, 17 TB storage (HDD)

We are not using ingest pipelines as we don't have a requirement to preprocess documents yet.
Can we use ingest pipelines for this? Couldn't documentation for this.

Data processing: It will be always new data.

Data size: Total data set size is 5TB. Primary shard size as of now set to 30GB.

Thanks,
Sri

snambu5878 · September 9, 2020, 4:46pm

Hi everyone, any quick feedback would be very mush appreciated.

Thanks,
Sri

Christian_Dahlqvist · September 9, 2020, 4:53pm

The whole point of a hot/warm architecture is that the hot nodes handle all indexing as they have much more performant storage in the form of local SSDs and indexing generally is very I/O intensive. Indexing into the warm nodes with HDDs is likely going to be a lot slower.

Indexing the majority of data (4 years) directly into 2 warm nodes while the hot nodes only handle a minority of data does therefore not make any sense to me. I would recommend processing data in chronological order and perform all indexing on the more powerful hot nodes. Once indices roll over or are no longer indexed into you can relocate these to the warm nodes.

Indexing also uses a good amount of heap, so if the warm nodes also perform indexing you may run into heap problems before you can fill up the nodes.

snambu5878 · September 11, 2020, 8:29am

Thank you for your response Chris. I missed one point here where the total data size is 5 TB (1TB per year). The initial idea of ingesting 4 yrs into warm and 1 year of data into hot is provided by Elastic Architecture team due to the fact that 1 year of data is most frequently searched, analyzed, need low SLA on Kibana(for dashboarding) whereas for 4 years, higher SLA for analysis, search is also fine.

Also the warm nodes JVM heap is set to 31.5GB will it be not enough?

Please let me know.

Thanks,
Sri

Christian_Dahlqvist · September 11, 2020, 8:36am

I agree you want to hold the oldest four years on the warm nodes and the last year on the hot node, but that does not mean you want to index directly into warm nodes. Index one year at a time and use the hot nodes for this. Start with the oldest data and move it to the warm node once completed. Then do the next year. Repeat until all data is ingested and the data will be distributed as recommended.

I think heap should be a bit smaller in order to benefit from compressed pointers but am not sure what the exact limit is.

system · October 9, 2020, 8:36am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Load unfairly distributed during large ingest Elasticsearch	6	631	May 8, 2017
ES Ingestion Performance Issue Elasticsearch	2	302	March 15, 2019
ES Cloud - Running ingest pipelines on warm nodes? Elasticsearch ingest-pipeline	41	1653	February 4, 2021
Elasticsearch Cluster Ingest Node Sizing Elasticsearch ingest-pipeline	4	1397	April 6, 2022
Sizing an ingest node Elasticsearch	1	388	July 5, 2017

Ingestion from ES-Spark to Ingestion node

Related topics