Ingestion from ES-Spark to Ingestion node

Hi Everyone,

We have 12 node cluster, 1 ingestion, 3 master, 3 hot data, 2 warm data, 2 kibana (same for coordinating) nodes and 1 monitoring node.

We are trying to ingestion data from Hadoop environment using ES-Spark plugin into elasticsearch. We need to ingest last 5 years of data into ES out of which 5th year of data should be ingested into Hot node, and 1-4rth year of data into warm nodes.

Don't see a way in the documentation to achieve this. Right now, when we run spark-submit against ingestion node - the entire data is loaded into both hot and warm nodes which we don't want. Is there any way we can mention that the data should be ingested into hot nodes vs warm nodes.

We would like to run this process daily as the delta of the data is really huge.

Any suggestions would be greatly appreciated.

Thank you very much.
Sri

You should look at using ILM. If you can't use that, what you want is allocation filtering.

What is the specification of the hot and warm nodes respectively? Are you using ingest pipelines? When you process deltas will that be new data only or may it include changes to older data? How large is the data set expected to be on disk once indexed (primary shard size only)?

Thank you Warkolm and Chris for the response.

Here are the details:

Hot nodes (per node): 40 CPU, 196GB RAM, 2TB storage (SSD)
Warm nodes (per node): 40 CPU, 196GB RAM, 17 TB storage (HDD)

We are not using ingest pipelines as we don't have a requirement to preprocess documents yet.
Can we use ingest pipelines for this? Couldn't documentation for this.

Data processing: It will be always new data.

Data size: Total data set size is 5TB. Primary shard size as of now set to 30GB.

Thanks,
Sri

Hi everyone, any quick feedback would be very mush appreciated.

Thanks,
Sri

The whole point of a hot/warm architecture is that the hot nodes handle all indexing as they have much more performant storage in the form of local SSDs and indexing generally is very I/O intensive. Indexing into the warm nodes with HDDs is likely going to be a lot slower.

Indexing the majority of data (4 years) directly into 2 warm nodes while the hot nodes only handle a minority of data does therefore not make any sense to me. I would recommend processing data in chronological order and perform all indexing on the more powerful hot nodes. Once indices roll over or are no longer indexed into you can relocate these to the warm nodes.

Indexing also uses a good amount of heap, so if the warm nodes also perform indexing you may run into heap problems before you can fill up the nodes.

Thank you for your response Chris. I missed one point here where the total data size is 5 TB (1TB per year). The initial idea of ingesting 4 yrs into warm and 1 year of data into hot is provided by Elastic Architecture team due to the fact that 1 year of data is most frequently searched, analyzed, need low SLA on Kibana(for dashboarding) whereas for 4 years, higher SLA for analysis, search is also fine.

Also the warm nodes JVM heap is set to 31.5GB will it be not enough?

Please let me know.

Thanks,
Sri

I agree you want to hold the oldest four years on the warm nodes and the last year on the hot node, but that does not mean you want to index directly into warm nodes. Index one year at a time and use the hot nodes for this. Start with the oldest data and move it to the warm node once completed. Then do the next year. Repeat until all data is ingested and the data will be distributed as recommended.

I think heap should be a bit smaller in order to benefit from compressed pointers but am not sure what the exact limit is.