ElasticSearch Data Transformation

(David Hughes) #1

I'm fairly well versed in ELK, but I've not yet figured out exactly the capabilities of other tools which integrate to ES data to transform it. I believe 'transform' is the right term to use here, from what I've been able to research on my own. Let me give an example of what I'm looking to do.

In our existing ES database, I've got the need to batch process it, where I enhance what is initially in documents with supplementary data. A good simple example is this.

Let's say my log file data, stashed into ES, has two log lines - one where a message is sent, and (potentially) one where the same message is received. Each send/receive has a key which uniquely identifies the transaction.

What I'm looking to do is to post-process the ES database, match send/receive pairs, add transaction time into the receive record, and add a 'matched' boolean value to each indicating the transaction was successful.

There's other needs I've got for post-processing, but this is a good example to get me going. My ideal situation would be where I can script up the transformation, based off of returned queried ES documents, and then be able to manipulate these documents by simply adding JSON content.

So my question is ... what is the best approach to accomplish this? Elasticsearch to/from hadoop? Pig, Hive, etc ... all of these seem applicable, but I'm not sure where to start.

Any guidance on where to dive into further would be great!

(James Baiera) #2

Each of the tools that we support in ES-Hadoop is a part of the greater Hadoop ecosystem for one or two different reasons: They either provide a way to process data for users comfortable with different toolsets, or provide a processing model that is distinct.

If you are most comfortable with writing SQL, then Hive and Spark are fairly good tools. Spark tends to get more traction because you can define your own custom functions in Java/Scala to run on the data with almost no hassle.

If you are looking to consume streaming data, Storm and Spark are two options we support right now. Do note though that ES isn't really built to be a streaming source, so we have limited to no support for streaming reads right now.

Based on downloads alone, Spark(batch/streaming + sql) and Hive(batch only sql) are the biggest integrations right now, followed by MR (a more legacy API), Pig (custom batch DSL), Storm (streaming), and Cascading (Java DSL).

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.