Import aggregated data

Hi All,

I'm trying to import a csv that was exported out of Oracle. The way the CSV is exported has a "count" of similar items. I want to "de"aggregate the data so that in Elasticsearch I can do timebased analysis on the counts.

The CSV looks like:

 "DATE","ID","ACTION","COUNT"
 01/01/2018 03:10:00,2006221,"C",993
 01/01/2018 08:30:00,2038003,"C",1
 01/01/2018 11:30:00,2002720,"C",73839
 01/01/2018 15:10:00,8263,"C",394
 01/01/2018 23:10:00,12139,"C",53
 01/02/2018 00:10:00,2,"A",3080
 01/02/2018 22:20:00,2012601,"C",5
 01/03/2018 06:00:00,2002720,"C",76113
 01/03/2018 08:40:00,11948,"C",1

I would like the "Count" column to represent a document count in ES so when I use arithmetic functions (SUM/AVE) over a time-frame it will use the count field. I will also be graphing this information in Kibana.

Here is current logstash config file

input {
  file {
    path => "/home/logstash/test.csv"
    type => "audit"
  }
}


filter {
  csv {
      columns => ["Date","ID","ACTION","COUNT"]
  }
  date {
    match => ["Date", "M/d/YYYY HH:mm:ss"]
    target => "@timestamp"
  }
}

output {
    elasticsearch {
        action => "index"
        hosts => "localhost"
        index => "base-test"
        workers => 1
   }
}

Having had a similar requirement in the past I couldn't come up with a solution via logstash or elasticsearch. You need to 'flatten' your data before you move it into elasticsearch via logstash.

So what I did was write a simple python script that read in the 'count' value and duplicated the line 'count' times into a new file. Of course you end up with massive input files but at least then you'll have e.g. 993 documents matching action C with ID 2006221 and the same timestamp value for date histograms in Kibana.

edit: Now that I think about it you can just use the Python API for elasticsearch and then read the count value, and simply write that line 'count' times into elasticsearch, negating the need to have huge input files. I don't know why I didn't think of that at the time.

I'll give that a try. Now to dust off my python skills.

Shouldn't be too hard, this is very simple actually. I'd just give you my code but I have no idea where it is.

https://docs.python.org/3/library/csv.html
https://elasticsearch-py.readthedocs.io/en/master/

edit: You don't even need the elasticsearch python api either. You can import csv, import urllib2 and import json. Serialize your CSV to JSON and use urllib2 to POST to elasticsearch on 9200 'count' number of times.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.