Now I am using elasticsearch as a realtime log analysis. Unfortunately, I
am having performance issue. To resolve this issue, I'd like to try custom
routing with timestamp because our realtime log analysis will be focused on
things such as the last 15 minutes, last 1 hour, or last 4 hours. Is it
possible sharding based on time range? If it's not supported yet, which can
be a good start to implement custom routing logic?
The second question is, currently, as I guess, elasticsearch routing logic
is gathering records with the same routing id in the same shard. If the
data has a skewed distribution on the routing field, does elasticsearch
make balanced shards across the cluster?
You are better to use routing in this case based on date value. You can
route at index or query time on every variable you want, just add the
routing parameter to your query like this : curl -XPUT http://127.0.0.1:9200/index/type/id?_routing=your_value. You should create
a custom timestamp base on day date and maybe add hour if you have a lot of
logs (_routing=2012100912). All documents indexed with the same routing
value will be routed to the same shard. Use the same logic to query ES.
Even if your data is too big for one shard, Elasticsearch will spread this
shard on 2 nodes. So your query will be optimized, only querying 2 shards
and not all the shards.
You are better to use routing in this case based on date value. You can
route at index or query time on every variable you want, just add the
routing parameter to your query like this : curl -XPUT http://127.0.0.1:9200/index/type/id?_routing=your_value. You should create a
custom timestamp base on day date and maybe add hour if you have a lot of
logs (_routing=2012100912). All documents indexed with the same routing
value will be routed to the same shard. Use the same logic to query ES.
Even if your data is too big for one shard, Elasticsearch will spread this
shard on 2 nodes. So your query will be optimized, only querying 2 shards
and not all the shards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.