I'm trying to import a csv that was exported out of Oracle. The way the CSV is exported has a "count" of similar items. I want to "de"aggregate the data so that in Elasticsearch I can do timebased analysis on the counts.
I would like the "Count" column to represent a document count in ES so when I use arithmetic functions (SUM/AVE) over a time-frame it will use the count field. I will also be graphing this information in Kibana.
Having had a similar requirement in the past I couldn't come up with a solution via logstash or elasticsearch. You need to 'flatten' your data before you move it into elasticsearch via logstash.
So what I did was write a simple python script that read in the 'count' value and duplicated the line 'count' times into a new file. Of course you end up with massive input files but at least then you'll have e.g. 993 documents matching action C with ID 2006221 and the same timestamp value for date histograms in Kibana.
edit: Now that I think about it you can just use the Python API for elasticsearch and then read the count value, and simply write that line 'count' times into elasticsearch, negating the need to have huge input files. I don't know why I didn't think of that at the time.
edit: You don't even need the elasticsearch python api either. You can import csv, import urllib2 and import json. Serialize your CSV to JSON and use urllib2 to POST to elasticsearch on 9200 'count' number of times.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.