Logstash or Spark?

Hello,
I have a project where I have to send millions of entries from different MYSQL databases to Elasticsearch (in Elastic Cloud instance). I'll have to make transformations to these data before sending it in Elasticsearch index. Considering the volume of data, would you recommand me to use Logstash or Spark or maybe another tool for this two actions (ingest & transform) ? Around 100 000 updates will be send by day.
Thanks very much in advance for your recommandations and advices

Didn't experienced Spark to be honnest
But with Logstash you can do more, there is already an JDBC input, enrich as your like, then ingest to Elastic cloud, and all that at scale you want

Hello Yassine, I thank you for your answer :slight_smile: !

The 1st question to ask is a conceptual one:

I'll have to make transformations to these data

Are these transformations local transformations on a single data point? Example for this operation would be e.g. enriching an IP address with a location or normalizing a value, e.g. currency conversion. This is a so called map operation. Map operations can be done easily with logstash or ingest.

Do you require transformations based on multiple data points by summarizing them? For example an average over a set of data points, a cardinality over values or a pivot around an entity. This requires a reduce operation. In elasticsearch a reduce is provided by aggregations, while transform allows you to write aggregation results back into an index.

Spark is a MapReduce framework and provides both, map and reduce.

So your questions can't be answered without knowing what type of transformations you are looking for.

The next question: How sophisticated are your transformations? Spark allows you both, simple transformations using e.g. SparkSQL, but also custom code that allows you to do almost anything.

Elasticsearch tooling (logstash/ingest/aggregations/transform) comes with a set of common data transformations, too. If the out of the box tooling is not sufficient, you can customize using scripts. The level of customization is however not as open as in Spark, where you can integrate any 3rd party library. This is not a big limitation, I guess 99% of data work does not require custom code.

This brings us to non-functional requirements:

Around 100 000 updates will be send by day.

Both Spark as well as the Elastic Stack can do this easily. It's only a question of resources. This is where it gets interesting. If you anyway want to store the data in elasticsearch it is preferable to use as less tools as possible. If your data transformations do not require complex custom code than I recommend to stay in the stack.

The other aspect of it: Data locality. Beside the data transformation itself it is always costly to move data around. By the nature of the thing an external data transformation requires more data to be moved than something that runs internally. E.g. if you want to do a pivot on data, a spark solution requires to dump all data, ship it top spark, parse it, create the result and ship it back to the stack. Within the stack a significant part of an aggregation runs where the data sits, on the data node that has the data. Only a data summary gets send between nodes and data serialization and parsing is much cheaper. Not to forget that this is all done within one availability zone in the cloud. Transferring this to Spark: Anything that can be done with SparkSQL should be preferred over custom code that does a similar thing, because SparkSQL is closer to the data and therefore optimized.

Long story short: Both from a maintenance perspective (less tools) and from the performance perspective (data locality) it is preferable to do transformations within the stack. If the stack can't provide the functionality, you can integrate other tools. The openness of the stack allows you to integrate with other tools.

3 Likes

Thank you very much for your very detailed answer !!! I will opt to stay in the ELK stack by combining ingest pipelines and aggregation functions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.