Best ETL tools for ingesting data from multiple data sources

Hi All,

I am looking for some advice for ETL options for a project that I am currently working on. Our data sets are generally things like (person, vehicle, dates, times, locations) … This data can come in multiple data formats (but mainly oracle at the start):

Oracle, CSV, Logs, Json, etc

We need to be able to modify this data on the fly (e.g. Convert date formats, modify strings, etc) before sending to elasticsearch

So far we have looked at:

  • Python Client, but could just as easily use the Ruby/PHP/Perl
  • Logstash with GROK (I also see there is a Ruby Plugin for logstash I could use there to give me the ability to modify the data as it is ingested)

My Question: Are there other better tools available? Or what is the best practice advised for this type of ETL process?

We will need Kerberos to authenticate to Oracle

Also if there was a tool that we could use for scheduling that would be useful also, if not we can always use cron but this is a quiet manual

Thanks very much for any help in advance

I guess it depends what you mean by "better"?

  • Configuring:
    • Graphical tools for pipeline design?
    • More powerful syntax Vs simpler syntax?
    • Broader adoption/ better documentation?
  • Operation
    • Scalablity?
    • Easier deployment of rule configurations + reference data?
    • Better runtime monitoring?

As a developer I tend to use Python for any ad-hoc loads due to its flexibility but I don't run production pipelines requiring monitoring etc so can't comment on that.
There's some nice looking GUI-driven ETL tools like Nifi [1] but I've no idea on its elasticsearch connectivity. Logstash will score more highly on that front.

[1] https://nifi.apache.org/

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.