Architecture question - ETL


(Wolfgang Bergbauer) #1

Hi all,

I have the following use cases. We are pushing file, directory, attribute and user data into elasticsearch.
Now we want to do reporting on top of it.
The data needs to be transformed because the initial format does not allow the queries that are necessary.
e.g. we have one script field that calculates the percentage of video files that a user has. The problem is that you can not get the top 100 users based on this scripted field. So the idea is to take the whole index and create a new index with all the data again, but now the scripted field is a "normal" field in the index and I can do the queries I need to. Finally we want to get the information displayed with highcharts.

Since I am coming from old faishoned ETL approached I would like to ask if this is the right approach and ask the following questions:

  • performing the data transformation in Elasticsearch by using indexes that are based on other indexes?
  • is it better to use the MongoDB for data transformations?
  • where would you store the final json doc that contains the data that is used to display the information in Highcharts?
  • also considering the transactional issues. The chart should be available all the time even when the indexes are recalculated. Is it true that since the highchart data is stored in one singel document in elasticsearch there will be no issues since it is transactional on a document level?

Thanks a lot for any insight.

Wolfgang


(David Pilato) #2
  • performing the data transformation in Elasticsearch by using indexes that are based on other indexes?

Why not. Not that common but yeah it happens that people are reindexing existing data to create a new index. Note that this should be even better if you can do that in real time when ingesting the original data. Not sure if you can do this though as it depends on your use case and the data you have.

  • is it better to use the MongoDB for data transformations?

I'm biased so I'm not sure I can tell but IMHO: no.

  • where would you store the final json doc that contains the data that is used to display the information in Highcharts?

Elasticsearch. I guess. But it depends on the use case again and what your documents are looking like.

  • also considering the transactional issues. The chart should be available all the time even when the indexes are recalculated. Is it true that since the highchart data is stored in one singel document in elasticsearch there will be no issues since it is transactional on a document level?

Use aliases. An alias is like a pointer to a concrete index. Reindex the data in a new index and then switch the alias when done.
When querying the data, query the alias and not the index. That should solve your need.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.