Number or date type for time serial data?


(Yong Wang) #1

hi, I am using elasticsearch to manage a lot of time serial data. It's about 500G(2000M events) for each day. the mapping looks like this:
"mappings" : {
"default" : {
"_all" : {"enabled" : false},
"properties" : {
"@version": { "index": "analyzed", "type": "integer" },
"@timestamp": { "index": "analyzed", "type": "date" },
"date_time":{"index":"not analyzed", "type":"integer"}
"netflow": {
"dynamic": true,
"type": "object",
"properties": {
"version": { "index": "analyzed", "type": "integer" },
"flow_seq_num": { "index": "not_analyzed", "type": "long" },
"engine_type": { "index": "not_analyzed", "type": "integer" },
"engine_id": { "index": "not_analyzed", "type": "integer" },
"sampling_algorithm": { "index": "not_analyzed", "type": "integer" },
"sampling_interval": { "index": "not_analyzed", "type": "integer" },
"flow_records": { "index": "not_analyzed", "type": "integer" },
"ipv4_src_addr": { "index": "analyzed", "type": "ip" },
"ipv4_dst_addr": { "index": "analyzed", "type": "ip" },
"ipv4_next_hop": { "index": "analyzed", "type": "ip" },
"input_snmp": { "index": "not_analyzed", "type": "long" },
"output_snmp": { "index": "not_analyzed", "type": "long" },
"in_pkts": { "index": "analyzed", "type": "long" },
"in_bytes": { "index": "analyzed", "type": "long" },
"first_switched": { "index": "not_analyzed", "type": "date" },
"last_switched": { "index": "not_analyzed", "type": "date" },
"l4_src_port": { "index": "analyzed", "type": "long" },
"l4_dst_port": { "index": "analyzed", "type": "long" },
"tcp_flags": { "index": "analyzed", "type": "integer" },
"protocol": { "index": "analyzed", "type": "integer" },
"src_tos": { "index": "analyzed", "type": "integer" },
"src_as": { "index": "analyzed", "type": "integer" },
"dst_as": { "index": "analyzed", "type": "integer" },
"src_mask": { "index": "analyzed", "type": "integer" },
"dst_mask": { "index": "analyzed", "type": "integer" }
}
}
}
}
}

the "date_time"field is the unix timestamp of "@timestamp". The typical search is histogram on date_time, for example, total number or in_bytes for every minute between 9:00-10:00.

so, my question is : is there any significant difference of performance between aggregation on time_data and @timestamp?

thanks.


(Mark Walkom) #2

Doing it on data_time will probably give you a lot of buckets as the values are more than likely unique.
But if you add based on minute resolution timestamps you'll get a lot less.

I'd expect the latter to be more efficient.


(system) #3