Rotating indices every few hours - output to elasticsearch


(Jeff Elliott) #1

Hi folks,

I'm running into an issue using ElasticSearch 1.5 wherein I get heap memory circuit breakers if I use daily indices, but query buffer overrun if I use hourly.

So I'd like to split the difference to get a little more performance. However, logstash output configuration doesn't seem to easily permit an index name between daily and hourly.

I've got a solution which uses some if - else if logic to name an added field based on the period of the day (ie: period 01 could be 00:00 to 06:00, etc). I then name the index using that. Is there a cleverer way to do this?

Here's an example. I'd appreciate feedback on performance, better ways to accomplish this, etc. For the purposes of simplicity and demonstration, I'm altering the path of a file output, and the periods are 20 seconds:

# Sample which breaks logs into files based on 20-second increments
input {
  stdin 
  {
    type => "stdin-type"
    add_field => ["index_help", "%{+s}"]
  }
}

filter
{
  mutate
  {
    convert => { "index_help" => "integer" }
  }
  if [index_help] < 20
  {
    mutate
    {
      add_field => { "index_period" => "P01" }
    }
  }
  else if [index_help] < 40
  {
    mutate
    { 
      add_field => { "index_period" => "P02" }
    }
  }
  else
  {
    mutate
    { 
      add_field => { "index_period" => "P03" }
    }
  }
}

output { 
  file 
  {
    path => "logstash-%{+YYYY-MM-dd.HH-mm}-%{index_period}.txt"
  }
}

Thanks,
Jeff


(Aaron Mildenstein) #2

I don't know your data set here, so please understand these are general recommendations based on experience. Your mileage may vary based on what you're querying, and how much data at once.

This is indicative of memory pressure. Resegmenting your indices to try to avoid memory pressure is not the best way to handle this.

I get heap memory circuit breakers if I use daily indices

This indicates that cluster memory is insufficient for the size of queries you intend to run. There are several ways to deal with this, including using doc_values in your template, if you haven't already and if your queries are against not_analyzed strings or numeric values. If your queries are against analyzed strings, or you are still seeing circuit breakers tripped even with doc_values set, then your best recourse is to increase the amount of cluster memory available. This can be done by increasing the memory on each node (up to a maximum of 30.5G), or by increasing the number of nodes.

One way to improve memory usage without adding hardware would be to start manually mapping the fields you have to use the smallest data-type possible. Elasticsearch guesses with the biggest possible value, so if you send a floating-point value through Logstash, Elasticsearch will consider it a double, and all integers become long, unless you map them otherwise. Properly mapping your values can also help avoid circuit breakers because the query won't need to allocate as much memory to aggregate a short as it will a long.

As you indicate you're using Logstash, I recommend upgrading to the most recent version, if only for the upgraded default template that ships with Logstash, because it applies doc_values by default as much as possible.

... query buffer overrun if I use hourly

This is likely because hourly indices add so many shards that your cluster is having some issues managing them all, or querying across all of them at once. Just because Elasticsearch doesn't immediately complain about the shard count doesn't mean that too many won't cause issues. Ideally, to have no pressure with shards, you should not have more than 400 - 700 total shards on a 30G node, with only a few of those being active (currently indexing). Memory pressure counts here too as each active shard wants 250M of 10% of the heap allocated to the index buffer (indices.memory.index_buffer_size, if you want to customize the amount set aside from the heap). Inactive shards each want 4M of that 10%. I say want, because if there are too many shards on a node, Elasticsearch will reduce the amount available to all shards to allow them all to fit. On a node with 30G of RAM, that means that 3G is allocated by default. If all shards were inactive, that would mean a safe fit for 774 shards on that server, but any active shards reduce that count quickly. This isn't to say you can't have a box with 1000 shards, but more often than not, we see similar problems with indexing and search performance on boxes with high shard counts.

Again, these are general recommendations I offer as the things you have shared indicate memory pressure. And the best way to fix memory pressure is not to subdivide your indices in ever more complex ways, as that only masks the problem and adds complexity to your queries.


(Jeff Elliott) #3

Hi,

Thanks for the comprehensive feedback. I'm attacking the problem this way because I've been frustrated in the other directions.

I'm running two nodes with ~32GB of RAM apiece, and the default heap breakdowns.

I've used doc_values where possible, although I'll certainly look into updating logstash. I'm using Amazon ElasticSearch (for now) which restricts my ability to choose a version there, and severely restricts what configuration I can do.

In this topic I lay out some of my difficulty with heap size; the tl;dr is that I'm running into a GC problem (I think) with the larger full-day indices. ie: The same number of documents (200M or so mostly apache access logs) in daily indices fails where hourly data is ok.

I'm aware of why the query queue is running out; that's why I'm trying to find middle ground. Apart from throwing more nodes at the problem, hourly logs yield 24x5x(days) shards. A Kibana dashboard with, say, 6 widgets is into thousands of queries right away. (As an aside: Kibana's really not helping by refreshing all the widgets simultaneously)

Luckily I'm only active on one or two indices, typically. And I'm managing by limiting the number of days of data I keep before archiving & deleting. Ideally, I'd like to keep the data longer, and feed more data into the indices each day. Right now I'm doing ~14-20M docs a day, and I wouldn't mind doubling that, but I've got to get through these growing pains, first.

Seems like each time I approach 150M-250M documents, it becomes very hard to convince ES to work well with my limitations.

So, to come full circle, with the 6-hour or 4-hour indices I hope to find a middle ground where I avoid the field data circuit breaker, but cut my shard count down by a factor of 4 or 6.

Thanks,
Jeff


(Aaron Mildenstein) #4

This is another reason why growing your cluster would serve you better.

Agreed. With your use case, it seems that Amazon Elasticsearch is not the best fit. You'd probably be better served hosting your own cluster on Amazon and managing it the way you want. You'd be able to tweak it better, to say nothing of being able to use the latest versions of Elasticsearch.

I agree that reducing shard count will have a palliative affect, but it's a band-aid approach to the problem for the reasons I outlined in my initial response. It should help, but it's not the best solution.


(system) #5