HIVE query time based indices

(Horst Birne) #1

Hi guys,

i set up i development hadoop cluster with the HIVE infrastructure and the elasticsearch-hadoop connector to allow SQL-like queries using ES-data.

As all work quite fine and sweet Joins are working and we think of using it in production, we came up with a nasty problem:

We are using (like many many other users out there) time based indices for log data in elasticsearch, so in order to improve user experience, it would be ideal to not having to create the HIVE metastore tables with static indices, but rather with sth like this:

CREATE EXTERNAL TABLE dynamic (logsource STRING, bytes BIGINT, src STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'logstash-{@timestamp:YYYY.MM.dd}/unix',
'es.nodes' = 'esnode:9200',
'es.query' = '?q=@timestamp:[2015-12-08T09:33Z TO 2015-12-08T09:35Z]') ;

like it is possible if you write data to elasticsearch and providing the necessary information in the query string in order to allow hadoop/elasticsearch to choose the right indices.

Using the _all index is ofc. possbile but regarding the tiered data setup of most users, this is very ineffective indeed.

Are there any suggestions or workarounds for this ?

Thanks for any input

(Costin Leau) #2

For reading it's a catch-22. The index is not known until the results are given so how is this suppose to work? Run the query against all indices that match the pattern (basically logstash-*) get the results but don't stream them, rather identify the indices and then rerun the query again but this time in a distributed manner?

I'm all for convenience but for ES-Hadoop to work it needs to know against what indices to run its query to discover the shards and distribute across them. And it can't do that before hand if the information is based on the query.

If you have a pattern (daily, hourly, last-30' indices), why not create automatically an alias for it automatically and run the job against that?

(Horst Birne) #3


I understand that this is maybe a very specialized use case for HIVE/es-hadoop and this dynamic pattern is very likely nothing that can be handled from the es-connector side.

Anyway, thanks for your suggestion with the index-aliases, that sth. i havenĀ“t thought about.

(system) #4