I'm looking for the best storage format to stage data for elasticsearch ingestion. By best I mean the storage format that will require little or no spark processing time. Ideally this Spark job would load data and push directly to elasticsearch.
Why do I want a Spark job that just loads data into elasticsearch? I have a 3TB dataset and I'd like to load a small percentage of it multiple times so we can test different es configurations and performance. After we have a configuration we're pleased with we'll likely have a daily job that summarizes and pushes to elasticsearch.
I currently have an aggregation job that converts raw events into summaries and writes to S3 in Parquet format. I then have a second job to reads that data and transform it into elastic bulk format before using es-hadoop.
Example Job that pushes to elasticsearch:
val dataframe = sqlContext.read.parquet(path) val myDF = dataframe.select("device_id").distinct val parentRDD = myDF.map( x=> ( Map(ID -> x(0).toString.trim), Map() ) ) val childrenRDD = dataframe.map( x=> ( Map( ID -> x(0), PARENT -> x(1) ), Map( "foo" -> x(0), "foo" -> x(2) ) ) ) EsSpark.saveToEsWithMeta(parentRDD, "development/parent") EsSpark.saveToEsWithMeta(childrenRDD, "development/child")
My problem with this example is the map action can take a considerable amount of time. Ideally I'd like to stage parentRDD and childrenRDD to file but I'm not sure what the best format would be?
I like using es-hadoop for pushing to elasticsearch because of it's configuration for bulk file size and retry policy but would be open to using something else.