We have been integrating an ElasticSearch log writer into Bro network
monitor (http://www.bro-ids.org) and we have a few users that are
monitoring extremely high volume networks and want to insert their logs
into ElasticSearch but their logging rate will hover around 40k-50k
documents per second for relatively long periods of time. We are already
doing index rotation which has been nice for expiring old data and with
searching constrained time periods but I suspect there is more we
could/should be doing.
Are there any tuning guides available for techniques we could be using to
insert documents at high rates?
Also, what I found is that many times 2-3 machines in the cluster can be a really beefy machines, and others for old logs can be less beefy. In that case, you can use index shard allocation and dynamic relocation by making sure the "current" index is on the beefy machine, and the old indices are moved to the "less" beefy machines.
We have been integrating an Elasticsearch log writer into Bro network monitor (http://www.bro-ids.org) and we have a few users that are monitoring extremely high volume networks and want to insert their logs into Elasticsearch but their logging rate will hover around 40k-50k documents per second for relatively long periods of time. We are already doing index rotation which has been nice for expiring old data and with searching constrained time periods but I suspect there is more we could/should be doing.
Are there any tuning guides available for techniques we could be using to insert documents at high rates?
Also, what I found is that many times 2-3 machines in the cluster can be a
really beefy machines, and others for old logs can be less beefy. In that
case, you can use index shard allocation and dynamic relocation by making
sure the "current" index is on the beefy machine, and the old indices are
moved to the "less" beefy machines.
We have been integrating an Elasticsearch log writer into Bro network
monitor (http://www.bro-ids.org) and we have a few users that are
monitoring extremely high volume networks and want to insert their logs
into Elasticsearch but their logging rate will hover around 40k-50k
documents per second for relatively long periods of time. We are already
doing index rotation which has been nice for expiring old data and with
searching constrained time periods but I suspect there is more we
could/should be doing.
Are there any tuning guides available for techniques we could be using
to insert documents at high rates?
I would also be interested in this information about dynamic shard
relocation. I have also been working on tuning for massive amounts of
inserts lately, with 40-50k/s sustained required. I have finally gotten
things tuned well with the primary change being implementing a spread each
index as wide as possible around the cluster approach to allocation. I
have written an alternate shard allocator based on this approach and will
be submitting a pull request in the next day or 2 after I finish writing my
test cases.
The primary problem I had up until this point was if I had to restart a
node or 2 it ended up bunching some indexes up on a small number of nodes
causing performance issues.
On Thursday, August 30, 2012 4:10:14 AM UTC-4, Filirom1 wrote:
Also, what I found is that many times 2-3 machines in the cluster can be
a really beefy machines, and others for old logs can be less beefy. In that
case, you can use index shard allocation and dynamic relocation by making
sure the "current" index is on the beefy machine, and the old indices are
moved to the "less" beefy machines.
On Aug 29, 2012, at 6:19 PM, Seth Hall <seth...@gmail.com <javascript:>>
wrote:
Hi all,
We have been integrating an Elasticsearch log writer into Bro network
monitor (http://www.bro-ids.org) and we have a few users that are
monitoring extremely high volume networks and want to insert their logs
into Elasticsearch but their logging rate will hover around 40k-50k
documents per second for relatively long periods of time. We are already
doing index rotation which has been nice for expiring old data and with
searching constrained time periods but I suspect there is more we
could/should be doing.
Are there any tuning guides available for techniques we could be using
to insert documents at high rates?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.