Architecture question - ETL

freakwave10 · November 7, 2018, 7:42am

Hi all,

I have the following use cases. We are pushing file, directory, attribute and user data into elasticsearch.
Now we want to do reporting on top of it.
The data needs to be transformed because the initial format does not allow the queries that are necessary.
e.g. we have one script field that calculates the percentage of video files that a user has. The problem is that you can not get the top 100 users based on this scripted field. So the idea is to take the whole index and create a new index with all the data again, but now the scripted field is a "normal" field in the index and I can do the queries I need to. Finally we want to get the information displayed with highcharts.

Since I am coming from old faishoned ETL approached I would like to ask if this is the right approach and ask the following questions:

performing the data transformation in Elasticsearch by using indexes that are based on other indexes?
is it better to use the MongoDB for data transformations?
where would you store the final json doc that contains the data that is used to display the information in Highcharts?
also considering the transactional issues. The chart should be available all the time even when the indexes are recalculated. Is it true that since the highchart data is stored in one singel document in elasticsearch there will be no issues since it is transactional on a document level?

Thanks a lot for any insight.

Wolfgang

dadoonet · November 7, 2018, 8:15am

performing the data transformation in Elasticsearch by using indexes that are based on other indexes?

Why not. Not that common but yeah it happens that people are reindexing existing data to create a new index. Note that this should be even better if you can do that in real time when ingesting the original data. Not sure if you can do this though as it depends on your use case and the data you have.

is it better to use the MongoDB for data transformations?

I'm biased so I'm not sure I can tell but IMHO: no.

where would you store the final json doc that contains the data that is used to display the information in Highcharts?

Elasticsearch. I guess. But it depends on the use case again and what your documents are looking like.

also considering the transactional issues. The chart should be available all the time even when the indexes are recalculated. Is it true that since the highchart data is stored in one singel document in elasticsearch there will be no issues since it is transactional on a document level?

Use aliases. An alias is like a pointer to a concrete index. Reindex the data in a new index and then switch the alias when done.
When querying the data, query the alias and not the index. That should solve your need.

system · December 5, 2018, 8:15am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Transform vs update vs recreating index for historical/last values indexes Elasticsearch	0	100	May 21, 2024
Elasticsearch transform index using scripted fields Elasticsearch transforms	3	683	October 27, 2020
Elasticsearch Transforms Elasticsearch transforms	4	636	March 30, 2021
Store and query by user metadata (last viewed, etc.) Elasticsearch	5	1470	August 24, 2018
Need some input/advice on scripting approach Elasticsearch	3	352	July 6, 2017

Architecture question - ETL

Related topics